arxiv: 2604.06098 · v2 · submitted 2026-04-07 · 💻 cs.IR · cs.CL

Recognition: no theorem link

JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections

Jayr Pereira , Leandro Fernandes , Erick de Brito , Roberto Lotufo , Luiz Bonifacio

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:25 UTC · model grok-4.3

classification 💻 cs.IR cs.CL

keywords information retrievallegal IRbenchmarkBrazilian Portuguesejurisprudencedomain adaptationBM25evaluation framework

0 comments

The pith

JU'A is a benchmark designed to enable reproducible and comparable evaluation of information retrieval methods across heterogeneous Brazilian legal collections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Legal information retrieval in Portuguese has been difficult to evaluate systematically because available datasets differ widely in document type, query style, and relevance definition. The paper introduces JU'A as a public benchmark and ongoing evaluation infrastructure that unifies multiple Brazilian legal collections under shared protocols, common ranking metrics, fixed splits where applicable, and a public leaderboard. It covers jurisprudence retrieval along with legislative, regulatory, and question-driven legal search. Evaluations of lexical, dense, and BM25-based reranking pipelines, including a domain-adapted embedding model, show that the benchmark distinguishes retrieval paradigms and exposes cross-dataset trade-offs, with adaptation helping most on aligned subsets and BM25 staying competitive elsewhere. A reader would care because the benchmark turns scattered legal datasets into a stable testbed for developing better search tools that lawyers and citizens rely on to locate relevant Brazilian laws and cases.

Core claim

JU'A is presented as both a benchmark and a continuous evaluation infrastructure for Brazilian legal IR that combines shared protocols, common ranking metrics, fixed splits when applicable, and a public leaderboard. The benchmark covers jurisprudence retrieval as well as broader legislative, regulatory, and question-driven legal search. Evaluations of lexical, dense, and BM25-based reranking pipelines, including a domain-adapted Qwen embedding model fine-tuned on JU'A-aligned supervision, demonstrate that the benchmark is sufficiently heterogeneous to distinguish retrieval paradigms and reveal substantial cross-dataset trade-offs, with domain adaptation yielding its clearest gains on the JU'

What carries the argument

JU'A, the public benchmark and evaluation infrastructure that standardizes protocols, metrics, and splits across multiple Brazilian legal collections to support comparable testing of retrieval methods.

Load-bearing premise

The chosen collections, query styles, and relevance judgments are representative enough of real Brazilian legal search needs that performance on JU'A will predict performance on new, unseen legal queries and documents.

What would settle it

A new, independently collected set of Brazilian legal documents and queries that produces retrieval performance patterns contradicting those on JU'A, such as one paradigm outperforming all others uniformly, would falsify the benchmark's ability to generalize.

read the original abstract

Legal information retrieval in Portuguese remains difficult to evaluate systematically because available datasets differ widely in document type, query style, and relevance definition. We present JU\'A, a public benchmark for Brazilian legal retrieval designed to support more reproducible and comparable evaluation across heterogeneous legal collections. More broadly, JU\'A is intended not only as a benchmark, but as a continuous evaluation infrastructure for Brazilian legal IR, combining shared protocols, common ranking metrics, fixed splits when applicable, and a public leaderboard. The benchmark covers jurisprudence retrieval as well as broader legislative, regulatory, and question-driven legal search. We evaluate lexical, dense, and BM25-based reranking pipelines, including a domain-adapted Qwen embedding model fine-tuned on JU\'A-aligned supervision. Results show that the benchmark is sufficiently heterogeneous to distinguish retrieval paradigms and reveal substantial cross-dataset trade-offs. Domain adaptation yields its clearest gains on the supervision-aligned JU\'A-Juris subset, while BM25 remains highly competitive on other collections, especially in settings with strong lexical and institutional phrasing cues. Overall, JU\'A provides a practical evaluation framework for studying legal retrieval across multiple Brazilian legal domains under a common benchmark design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JU'A is a useful new benchmark for Brazilian legal IR that shows real cross-collection differences, but the domain-adaptation gains need explicit checks for train-test overlap.

read the letter

JU'A pulls together several Brazilian legal collections under one set of protocols, metrics, and a public leaderboard. That is the actual new piece: prior work had scattered datasets without this kind of unified setup for Portuguese legal retrieval. The evaluations run lexical baselines, dense models, rerankers, and one fine-tuned Qwen embedding model, and the reported numbers show clear differences in which method wins on which collection. BM25 staying competitive on collections with strong phrasing is a straightforward observation that matches what many retrieval papers find outside legal text.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces JU'A, a public benchmark and continuous evaluation infrastructure for information retrieval over Brazilian legal collections spanning jurisprudence, legislation, regulation, and question-driven search. It evaluates lexical (BM25), dense, and reranking pipelines, including a domain-adapted Qwen embedding model fine-tuned on JU'A-aligned supervision. Results indicate the benchmark is heterogeneous enough to distinguish retrieval paradigms, reveal cross-dataset trade-offs, with domain adaptation showing clearest gains on the supervision-aligned JU'A-Juris subset while BM25 remains competitive on other collections.

Significance. If the data splits and relevance judgments are sound, JU'A fills a gap in reproducible Portuguese legal IR evaluation by providing shared protocols, fixed splits, and a public leaderboard. The empirical demonstration of paradigm differentiation and BM25 competitiveness on heterogeneous legal data is a useful contribution for practical system design.

major comments (2)

[§4.2 (Fine-tuning details)] §4.2 (Fine-tuning details): The description states the Qwen model was fine-tuned on 'JU'A-aligned supervision' and reports clearest gains on the 'supervision-aligned JU'A-Juris subset,' but does not explicitly confirm that the fine-tuning corpus is strictly disjoint from the JU'A-Juris test queries, documents, and judgments. Any overlap would mean the reported lift reflects leakage rather than adaptation, directly threatening the internal validity of the domain-adaptation and cross-dataset trade-off claims.
[§5 (Experimental results)] §5 (Experimental results): The reported performance differences across lexical, dense, and reranking methods lack statistical significance tests or confidence intervals. For small legal query sets this weakens the claim that the benchmark 'distinguishes retrieval paradigms' and reveals 'substantial cross-dataset trade-offs.'

minor comments (2)

[Abstract] Abstract and title: The escaped form 'JU' A' should be rendered consistently as JU'A throughout.
[§3 (Benchmark construction)] §3 (Benchmark construction): The description of relevance judgment collection could be expanded with inter-annotator agreement statistics to support the claim of reliable labels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, confirming where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [§4.2 (Fine-tuning details)] §4.2 (Fine-tuning details): The description states the Qwen model was fine-tuned on 'JU'A-aligned supervision' and reports clearest gains on the 'supervision-aligned JU'A-Juris subset,' but does not explicitly confirm that the fine-tuning corpus is strictly disjoint from the JU'A-Juris test queries, documents, and judgments. Any overlap would mean the reported lift reflects leakage rather than adaptation, directly threatening the internal validity of the domain-adaptation and cross-dataset trade-off claims.

Authors: We appreciate the referee's emphasis on this critical validity concern. The JU'A-aligned supervision used for fine-tuning was derived exclusively from training splits and external aligned sources that have no overlap with the test queries, documents, or relevance judgments in the JU'A-Juris evaluation subset. We will revise §4.2 to include an explicit statement confirming this disjointness, which will directly address the internal validity of the domain-adaptation results and the reported cross-dataset trade-offs. revision: yes
Referee: [§5 (Experimental results)] §5 (Experimental results): The reported performance differences across lexical, dense, and reranking methods lack statistical significance tests or confidence intervals. For small legal query sets this weakens the claim that the benchmark 'distinguishes retrieval paradigms' and reveals 'substantial cross-dataset trade-offs.'

Authors: We agree that the lack of statistical tests and confidence intervals weakens the strength of our claims about paradigm differentiation and cross-dataset trade-offs, especially with smaller query sets in some collections. In the revised manuscript we will add bootstrap confidence intervals around all reported metrics and include paired statistical tests (e.g., Wilcoxon signed-rank) to assess the significance of performance differences. These additions will provide a more rigorous foundation for the empirical observations while remaining feasible given the benchmark design. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark evaluation

full rationale

The paper introduces JU'A as a new benchmark collection with fixed splits and protocols, then reports direct empirical results from running existing lexical/dense methods plus one fine-tuned Qwen model on those collections. No mathematical derivation, uniqueness theorem, or 'prediction' is claimed that reduces by construction to fitted parameters or self-citations; performance numbers are measured outputs on the stated test partitions. Domain-adaptation gains are reported on the supervision-aligned subset under standard train/test separation, and cross-dataset trade-offs are observed comparisons rather than self-referential loops. The work is therefore self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard IR evaluation practices and existing embedding models; no new mathematical axioms, free parameters fitted to the target result, or invented entities are introduced beyond the benchmark construction itself.

pith-pipeline@v0.9.0 · 5512 in / 1123 out tokens · 40364 ms · 2026-05-10T18:25:06.486510+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Domain-Adaptive Dense Retrieval for Brazilian Legal Search
cs.IR 2026-05 unverdicted novelty 4.0

Mixed training of Qwen3-Embedding-4B on legal data plus SQuAD-pt yields higher average NDCG@10 (0.447), MRR@10 (0.595), and MAP@10 (0.308) across six Portuguese retrieval datasets than legal-only or base models, with ...

Reference graph

Works this paper leans on

28 extracted references · 19 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Fast and power efficient GPU-based explicit elastic wave propagation analysis by low- ordered orthogonal voxel finite element with INT8 tensor cores

Sansone, C., Sperlí, G.: Legal Information Retrieval systems: State-of-the-art and open issues. Information Systems106, 101967 (2022) https://doi.org/10.1016/j. is.2021.101967

work page doi:10.1016/j 2022
[2]

Artificial Intelligence and Law25(1), 65–87 (2017) https://doi.org/10.1007/ s10506-017-9195-8

Opijnen,M.,Santos,C.:Ontheconceptofrelevanceinlegalinformationretrieval. Artificial Intelligence and Law25(1), 65–87 (2017) https://doi.org/10.1007/ s10506-017-9195-8

2017
[3]

Language Resources and Evaluation 60(1), 23 (2026) https://doi.org/10.1007/s10579-025-09881-w

Fernandes, L.C., Ribeiro, L.d.S., Castro, M.V.B., Silva Pacheco, L.A., Oliveira Sandes, E.F.: JurisTCU: a Brazilian Portuguese information retrieval dataset with query relevance judgments. Language Resources and Evaluation 60(1), 23 (2026) https://doi.org/10.1007/s10579-025-09881-w

work page doi:10.1007/s10579-025-09881-w 2026
[4]

Language Resources and Evaluation59(2), 1257 (2025) https://doi.org/10.1007/ s10579-024-09767-3

Vitório, D., Souza, E., Martins, L., Silva, N.F.F., Carvalho, A.C.P.d.L., Oliveira, A.L.I., Andrade, F.E.: Building a relevance feedback corpus for legal informa- tion retrieval in the real-case scenario of the Brazilian Chamber of Deputies. Language Resources and Evaluation59(2), 1257 (2025) https://doi.org/10.1007/ s10579-024-09767-3

2025
[5]

In: Freitas, R., Furtado, D

Júnior, J.D., Faria, A., Oliveira, E.S., Brito, E., Teotonio, M., Assumpção, A., Carmo,D.,Lotufo,R.,Pereira,J.:BR-TaxQA-R:ADatasetforQuestionAnswer- ing with References for Brazilian Personal Income Tax Law, Including Case Law. In: Freitas, R., Furtado, D. (eds.) Intelligent Systems, pp. 208–222. Springer, Cham (2026)

2026
[6]

ACM Comput

Ariai, F., Mackenzie, J., Demartini, G.: Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges. ACM Computing Surveys58(6), 1–37 (2025) https://doi.org/10.1145/3777009

work page doi:10.1145/3777009 2025
[7]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Feng, Y., Li, C., Ng, V.: Legal Case Retrieval: A Survey of the State of the Art. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6472–6485 (2024). https://doi.org/10. 18653/v1/2024.acl-long.350 .https://aclanthology.org/2024.acl-long.350/

2024
[8]

In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

Qin, W., Cao, Z., Yu, W., Si, Z., Chen, S., Xu, J.: Explicitly integrating judgment prediction with legal document retrieval: A law-guided generative approach. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’24, pp. 2210–2220. Association for Computing Machinery, New York, NY, USA...

work page doi:10.1145/3626772.3657717 2024
[9]

Artificial Intelligence and Law18(4), 347– 386 (2010) https://doi.org/10.1007/s10506-010-9093-9

Oard, D.W., Baron, J.R., Hedin, B., Lewis, D.D., Tomlinson, S.: Evaluation of Information Retrieval for E-Discovery. Artificial Intelligence and Law18(4), 347– 386 (2010) https://doi.org/10.1007/s10506-010-9093-9

work page doi:10.1007/s10506-010-9093-9 2010
[10]

Uniﬁedqa: Crossing format boundaries with a single QA system.CoRR, abs/2005.00700, 2020a

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.-t.: Dense Passage Retrieval for Open-Domain Question Answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781 (2020). https://doi.org/10.18653/v1/2020. emnlp-main.550 .https://aclanthology.org/2020.emnlp-main.550/

work page doi:10.18653/v1/2020 2020
[11]

Computer Science Review 60, 100906 (2026) https://doi.org/10.1016/j.cosrev.2026.100906

He, C., Hu, H., Li, Y., Zhang, H., Zhang, Q.: A Survey of Large Language Models for Legal Tasks: Progress, Prospects and Challenges. Computer Science Review 60, 100906 (2026) https://doi.org/10.1016/j.cosrev.2026.100906

work page doi:10.1016/j.cosrev.2026.100906 2026
[12]

Journal of Empirical Legal Studies22(2), 216–242 (2025) https://doi.org/10

Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C.D., Ho, D.E.: Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. Journal of Empirical Legal Studies22(2), 216–242 (2025) https://doi.org/10. 1111/jels.12413

2025
[13]

The P robabilistic R elevance F ramework: BM25 and B eyond

Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval3(4), 333–389 (2009) https://doi.org/10.1561/1500000019

work page doi:10.1561/1500000019 2009
[14]

In: Anais do XV Simpósio Brasileiro de Tecnologia da Informação e da Lin- guagem Humana, pp

Bueno, M., Oliveira, E.S., Nogueira, R., Lotufo, R., Pereira, J.: Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers. In: Anais do XV Simpósio Brasileiro de Tecnologia da Informação e da Lin- guagem Humana, pp. 236–246 (2024). https://doi.org/10.5753/stil.2024.245426 . https://sol.sbc.org.br/index.php/stil/article/view/31136

work page doi:10.5753/stil.2024.245426 2024
[15]

In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (2021).https://openreview.net/forum?id=wCu6T5xFjeJ

Thakur, N., Reimers, N., R"ucklé, A., Srivastava, A., Gurevych, I.: BEIR: A het- erogeneous benchmark for zero-shot evaluation of information retrieval models. In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (2021).https://openreview.net/forum?id=wCu6T5xFjeJ

2021
[16]

MTEB: Massive Text Embedding Benchmark

Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: MTEB: Massive Text Embedding Benchmark. arXiv preprint arXiv:2210.07316 (2022)

work page internal anchor Pith review arXiv 2022
[17]

In: Proceedings of the 11th Forum for Information Retrieval Evaluation, pp

Bhattacharya, P., Ghosh, K., Ghosh, S., Pal, A., Mehta, P., Bhat- tacharya, A., Majumder, P.: FIRE 2019 AILA Track: Artificial Intelligence for Legal Assistance. In: Proceedings of the 11th Forum for Information Retrieval Evaluation, pp. 4–6 (2019). https://doi.org/10.1145/3368567.3368587 . https://doi.org/10.1145/3368567.3368587

work page doi:10.1145/3368567.3368587 2019
[18]

The Review of Socionetwork Strategies16, 111–133 (2022) https://doi.org/10.1007/s12626-022-00105-z

Rabelo, J., Goebel, R., Kim, M.-Y., Yoshioka, M., Kano, Y., Satoh, K.: Overview and Discussion of the Competition on Legal Information Extraction/Entailment 21 (COLIEE) 2021. The Review of Socionetwork Strategies16, 111–133 (2022) https://doi.org/10.1007/s12626-022-00105-z

work page doi:10.1007/s12626-022-00105-z 2021
[19]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Wang, L., Yang, N., Huang, X., Yang, L., Gao, F., Wei, Z., Zhang, Y., Zhou, M., et al.: Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022)

work page internal anchor Pith review arXiv 2022
[20]

Portal institucional

Tribunal de Contas da União (TCU): Jurisprudência (Portal do TCU). Portal institucional. Accessed: 2026-03-09 (2026). https://portal.tcu.gov.br/ jurisprudencia/

2026
[21]

Por- tal de dados abertos

Tribunal de Contas da União (TCU): Dados abertos: Jurisprudência. Por- tal de dados abertos. Accessed: 2026-03-09 (2026). https://sites.tcu.gov.br/ dados-abertos/jurisprudencia/

2026
[22]

In: Anais do XVI Simpósio Brasileiro de Tecnologia da Informação e da Lin- guagem Humana, pp

Brito, E.d., Teotonio, M., Lotufo, R., Pereira, J.: Avaliando Ferramentas de IA Generativa no Conjunto de Perguntas e Respostas da Receita Federal. In: Anais do XVI Simpósio Brasileiro de Tecnologia da Informação e da Lin- guagem Humana, pp. 688–692 (2025). https://doi.org/10.5753/stil.2025.37872 . https://sol.sbc.org.br/index.php/stil/article/view/37872

work page doi:10.5753/stil.2025.37872 2025
[23]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv preprint arXiv:1606.05250 (2016)

work page internal anchor Pith review arXiv 2016
[24]

Zhao, Y., Huang, J., Hu, J., Wang, X., Mao, Y., Zhang, D., Zhang, H., Jiang, Z., Wu, Z., Ai, B., Wang, A., Zhou, W., Chen, Y., Team, M.: Swift: A scalable lightweightinfrastructureforfine-tuning.arXivpreprintarXiv:2408.05517(2024)

work page arXiv 2024
[25]

Cumulated gain-based evaluation of IR techniques,

Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems20(4), 422–446 (2002) https://doi. org/10.1145/582415.582418

work page doi:10.1145/582415.582418 2002
[26]

Voorhees and Dawn M

Voorhees, E.M., Tice, D.M.: Building a question answering test collection. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’00, pp. 200–207. Association for Computing Machinery, New York, NY, USA (2000). https://doi. org/10.1145/345508.345577 .https://doi.org/10.1145/345508.345577

work page doi:10.1145/345508.345577 2000
[27]

Cambridge University Press, Cambridge (2008)

Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

2008
[28]

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

Thakur, N., Bonifacio, L., Fröbe, M., Bondarenko, A., Kamalloo, E., Pot- thast, M., Hagen, M., Lin, J.: Systematic evaluation of neural retrieval models on the Touché 2020 argument retrieval subset of BEIR. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval (2024). https://doi.org/10.1145/...

work page doi:10.1145/3626772.3657861 2020