pith. machine review for the scientific record. sign in

arxiv: 2604.06098 · v2 · submitted 2026-04-07 · 💻 cs.IR · cs.CL

Recognition: no theorem link

JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:25 UTC · model grok-4.3

classification 💻 cs.IR cs.CL
keywords information retrievallegal IRbenchmarkBrazilian Portuguesejurisprudencedomain adaptationBM25evaluation framework
0
0 comments X

The pith

JU'A is a benchmark designed to enable reproducible and comparable evaluation of information retrieval methods across heterogeneous Brazilian legal collections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Legal information retrieval in Portuguese has been difficult to evaluate systematically because available datasets differ widely in document type, query style, and relevance definition. The paper introduces JU'A as a public benchmark and ongoing evaluation infrastructure that unifies multiple Brazilian legal collections under shared protocols, common ranking metrics, fixed splits where applicable, and a public leaderboard. It covers jurisprudence retrieval along with legislative, regulatory, and question-driven legal search. Evaluations of lexical, dense, and BM25-based reranking pipelines, including a domain-adapted embedding model, show that the benchmark distinguishes retrieval paradigms and exposes cross-dataset trade-offs, with adaptation helping most on aligned subsets and BM25 staying competitive elsewhere. A reader would care because the benchmark turns scattered legal datasets into a stable testbed for developing better search tools that lawyers and citizens rely on to locate relevant Brazilian laws and cases.

Core claim

JU'A is presented as both a benchmark and a continuous evaluation infrastructure for Brazilian legal IR that combines shared protocols, common ranking metrics, fixed splits when applicable, and a public leaderboard. The benchmark covers jurisprudence retrieval as well as broader legislative, regulatory, and question-driven legal search. Evaluations of lexical, dense, and BM25-based reranking pipelines, including a domain-adapted Qwen embedding model fine-tuned on JU'A-aligned supervision, demonstrate that the benchmark is sufficiently heterogeneous to distinguish retrieval paradigms and reveal substantial cross-dataset trade-offs, with domain adaptation yielding its clearest gains on the JU'

What carries the argument

JU'A, the public benchmark and evaluation infrastructure that standardizes protocols, metrics, and splits across multiple Brazilian legal collections to support comparable testing of retrieval methods.

Load-bearing premise

The chosen collections, query styles, and relevance judgments are representative enough of real Brazilian legal search needs that performance on JU'A will predict performance on new, unseen legal queries and documents.

What would settle it

A new, independently collected set of Brazilian legal documents and queries that produces retrieval performance patterns contradicting those on JU'A, such as one paradigm outperforming all others uniformly, would falsify the benchmark's ability to generalize.

read the original abstract

Legal information retrieval in Portuguese remains difficult to evaluate systematically because available datasets differ widely in document type, query style, and relevance definition. We present JU\'A, a public benchmark for Brazilian legal retrieval designed to support more reproducible and comparable evaluation across heterogeneous legal collections. More broadly, JU\'A is intended not only as a benchmark, but as a continuous evaluation infrastructure for Brazilian legal IR, combining shared protocols, common ranking metrics, fixed splits when applicable, and a public leaderboard. The benchmark covers jurisprudence retrieval as well as broader legislative, regulatory, and question-driven legal search. We evaluate lexical, dense, and BM25-based reranking pipelines, including a domain-adapted Qwen embedding model fine-tuned on JU\'A-aligned supervision. Results show that the benchmark is sufficiently heterogeneous to distinguish retrieval paradigms and reveal substantial cross-dataset trade-offs. Domain adaptation yields its clearest gains on the supervision-aligned JU\'A-Juris subset, while BM25 remains highly competitive on other collections, especially in settings with strong lexical and institutional phrasing cues. Overall, JU\'A provides a practical evaluation framework for studying legal retrieval across multiple Brazilian legal domains under a common benchmark design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces JU'A, a public benchmark and continuous evaluation infrastructure for information retrieval over Brazilian legal collections spanning jurisprudence, legislation, regulation, and question-driven search. It evaluates lexical (BM25), dense, and reranking pipelines, including a domain-adapted Qwen embedding model fine-tuned on JU'A-aligned supervision. Results indicate the benchmark is heterogeneous enough to distinguish retrieval paradigms, reveal cross-dataset trade-offs, with domain adaptation showing clearest gains on the supervision-aligned JU'A-Juris subset while BM25 remains competitive on other collections.

Significance. If the data splits and relevance judgments are sound, JU'A fills a gap in reproducible Portuguese legal IR evaluation by providing shared protocols, fixed splits, and a public leaderboard. The empirical demonstration of paradigm differentiation and BM25 competitiveness on heterogeneous legal data is a useful contribution for practical system design.

major comments (2)
  1. [§4.2 (Fine-tuning details)] §4.2 (Fine-tuning details): The description states the Qwen model was fine-tuned on 'JU'A-aligned supervision' and reports clearest gains on the 'supervision-aligned JU'A-Juris subset,' but does not explicitly confirm that the fine-tuning corpus is strictly disjoint from the JU'A-Juris test queries, documents, and judgments. Any overlap would mean the reported lift reflects leakage rather than adaptation, directly threatening the internal validity of the domain-adaptation and cross-dataset trade-off claims.
  2. [§5 (Experimental results)] §5 (Experimental results): The reported performance differences across lexical, dense, and reranking methods lack statistical significance tests or confidence intervals. For small legal query sets this weakens the claim that the benchmark 'distinguishes retrieval paradigms' and reveals 'substantial cross-dataset trade-offs.'
minor comments (2)
  1. [Abstract] Abstract and title: The escaped form 'JU' A' should be rendered consistently as JU'A throughout.
  2. [§3 (Benchmark construction)] §3 (Benchmark construction): The description of relevance judgment collection could be expanded with inter-annotator agreement statistics to support the claim of reliable labels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, confirming where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§4.2 (Fine-tuning details)] §4.2 (Fine-tuning details): The description states the Qwen model was fine-tuned on 'JU'A-aligned supervision' and reports clearest gains on the 'supervision-aligned JU'A-Juris subset,' but does not explicitly confirm that the fine-tuning corpus is strictly disjoint from the JU'A-Juris test queries, documents, and judgments. Any overlap would mean the reported lift reflects leakage rather than adaptation, directly threatening the internal validity of the domain-adaptation and cross-dataset trade-off claims.

    Authors: We appreciate the referee's emphasis on this critical validity concern. The JU'A-aligned supervision used for fine-tuning was derived exclusively from training splits and external aligned sources that have no overlap with the test queries, documents, or relevance judgments in the JU'A-Juris evaluation subset. We will revise §4.2 to include an explicit statement confirming this disjointness, which will directly address the internal validity of the domain-adaptation results and the reported cross-dataset trade-offs. revision: yes

  2. Referee: [§5 (Experimental results)] §5 (Experimental results): The reported performance differences across lexical, dense, and reranking methods lack statistical significance tests or confidence intervals. For small legal query sets this weakens the claim that the benchmark 'distinguishes retrieval paradigms' and reveals 'substantial cross-dataset trade-offs.'

    Authors: We agree that the lack of statistical tests and confidence intervals weakens the strength of our claims about paradigm differentiation and cross-dataset trade-offs, especially with smaller query sets in some collections. In the revised manuscript we will add bootstrap confidence intervals around all reported metrics and include paired statistical tests (e.g., Wilcoxon signed-rank) to assess the significance of performance differences. These additions will provide a more rigorous foundation for the empirical observations while remaining feasible given the benchmark design. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark evaluation

full rationale

The paper introduces JU'A as a new benchmark collection with fixed splits and protocols, then reports direct empirical results from running existing lexical/dense methods plus one fine-tuned Qwen model on those collections. No mathematical derivation, uniqueness theorem, or 'prediction' is claimed that reduces by construction to fitted parameters or self-citations; performance numbers are measured outputs on the stated test partitions. Domain-adaptation gains are reported on the supervision-aligned subset under standard train/test separation, and cross-dataset trade-offs are observed comparisons rather than self-referential loops. The work is therefore self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard IR evaluation practices and existing embedding models; no new mathematical axioms, free parameters fitted to the target result, or invented entities are introduced beyond the benchmark construction itself.

pith-pipeline@v0.9.0 · 5512 in / 1123 out tokens · 40364 ms · 2026-05-10T18:25:06.486510+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Domain-Adaptive Dense Retrieval for Brazilian Legal Search

    cs.IR 2026-05 unverdicted novelty 4.0

    Mixed training of Qwen3-Embedding-4B on legal data plus SQuAD-pt yields higher average NDCG@10 (0.447), MRR@10 (0.595), and MAP@10 (0.308) across six Portuguese retrieval datasets than legal-only or base models, with ...

Reference graph

Works this paper leans on

28 extracted references · 19 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Fast and power efficient GPU-based explicit elastic wave propagation analysis by low- ordered orthogonal voxel finite element with INT8 tensor cores

    Sansone, C., Sperlí, G.: Legal Information Retrieval systems: State-of-the-art and open issues. Information Systems106, 101967 (2022) https://doi.org/10.1016/j. is.2021.101967

  2. [2]

    Artificial Intelligence and Law25(1), 65–87 (2017) https://doi.org/10.1007/ s10506-017-9195-8

    Opijnen,M.,Santos,C.:Ontheconceptofrelevanceinlegalinformationretrieval. Artificial Intelligence and Law25(1), 65–87 (2017) https://doi.org/10.1007/ s10506-017-9195-8

  3. [3]

    Language Resources and Evaluation 60(1), 23 (2026) https://doi.org/10.1007/s10579-025-09881-w

    Fernandes, L.C., Ribeiro, L.d.S., Castro, M.V.B., Silva Pacheco, L.A., Oliveira Sandes, E.F.: JurisTCU: a Brazilian Portuguese information retrieval dataset with query relevance judgments. Language Resources and Evaluation 60(1), 23 (2026) https://doi.org/10.1007/s10579-025-09881-w

  4. [4]

    Language Resources and Evaluation59(2), 1257 (2025) https://doi.org/10.1007/ s10579-024-09767-3

    Vitório, D., Souza, E., Martins, L., Silva, N.F.F., Carvalho, A.C.P.d.L., Oliveira, A.L.I., Andrade, F.E.: Building a relevance feedback corpus for legal informa- tion retrieval in the real-case scenario of the Brazilian Chamber of Deputies. Language Resources and Evaluation59(2), 1257 (2025) https://doi.org/10.1007/ s10579-024-09767-3

  5. [5]

    In: Freitas, R., Furtado, D

    Júnior, J.D., Faria, A., Oliveira, E.S., Brito, E., Teotonio, M., Assumpção, A., Carmo,D.,Lotufo,R.,Pereira,J.:BR-TaxQA-R:ADatasetforQuestionAnswer- ing with References for Brazilian Personal Income Tax Law, Including Case Law. In: Freitas, R., Furtado, D. (eds.) Intelligent Systems, pp. 208–222. Springer, Cham (2026)

  6. [6]

    ACM Comput

    Ariai, F., Mackenzie, J., Demartini, G.: Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges. ACM Computing Surveys58(6), 1–37 (2025) https://doi.org/10.1145/3777009

  7. [7]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

    Feng, Y., Li, C., Ng, V.: Legal Case Retrieval: A Survey of the State of the Art. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6472–6485 (2024). https://doi.org/10. 18653/v1/2024.acl-long.350 .https://aclanthology.org/2024.acl-long.350/

  8. [8]

    In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

    Qin, W., Cao, Z., Yu, W., Si, Z., Chen, S., Xu, J.: Explicitly integrating judgment prediction with legal document retrieval: A law-guided generative approach. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’24, pp. 2210–2220. Association for Computing Machinery, New York, NY, USA...

  9. [9]

    Artificial Intelligence and Law18(4), 347– 386 (2010) https://doi.org/10.1007/s10506-010-9093-9

    Oard, D.W., Baron, J.R., Hedin, B., Lewis, D.D., Tomlinson, S.: Evaluation of Information Retrieval for E-Discovery. Artificial Intelligence and Law18(4), 347– 386 (2010) https://doi.org/10.1007/s10506-010-9093-9

  10. [10]

    Unifiedqa: Crossing format boundaries with a single QA system.CoRR, abs/2005.00700, 2020a

    Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.-t.: Dense Passage Retrieval for Open-Domain Question Answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781 (2020). https://doi.org/10.18653/v1/2020. emnlp-main.550 .https://aclanthology.org/2020.emnlp-main.550/

  11. [11]

    Computer Science Review 60, 100906 (2026) https://doi.org/10.1016/j.cosrev.2026.100906

    He, C., Hu, H., Li, Y., Zhang, H., Zhang, Q.: A Survey of Large Language Models for Legal Tasks: Progress, Prospects and Challenges. Computer Science Review 60, 100906 (2026) https://doi.org/10.1016/j.cosrev.2026.100906

  12. [12]

    Journal of Empirical Legal Studies22(2), 216–242 (2025) https://doi.org/10

    Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C.D., Ho, D.E.: Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. Journal of Empirical Legal Studies22(2), 216–242 (2025) https://doi.org/10. 1111/jels.12413

  13. [13]

    The P robabilistic R elevance F ramework: BM25 and B eyond

    Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval3(4), 333–389 (2009) https://doi.org/10.1561/1500000019

  14. [14]

    In: Anais do XV Simpósio Brasileiro de Tecnologia da Informação e da Lin- guagem Humana, pp

    Bueno, M., Oliveira, E.S., Nogueira, R., Lotufo, R., Pereira, J.: Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers. In: Anais do XV Simpósio Brasileiro de Tecnologia da Informação e da Lin- guagem Humana, pp. 236–246 (2024). https://doi.org/10.5753/stil.2024.245426 . https://sol.sbc.org.br/index.php/stil/article/view/31136

  15. [15]

    In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (2021).https://openreview.net/forum?id=wCu6T5xFjeJ

    Thakur, N., Reimers, N., R"ucklé, A., Srivastava, A., Gurevych, I.: BEIR: A het- erogeneous benchmark for zero-shot evaluation of information retrieval models. In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (2021).https://openreview.net/forum?id=wCu6T5xFjeJ

  16. [16]

    MTEB: Massive Text Embedding Benchmark

    Muennighoff, N., Tazi, N., Magne, L., Reimers, N.: MTEB: Massive Text Embedding Benchmark. arXiv preprint arXiv:2210.07316 (2022)

  17. [17]

    In: Proceedings of the 11th Forum for Information Retrieval Evaluation, pp

    Bhattacharya, P., Ghosh, K., Ghosh, S., Pal, A., Mehta, P., Bhat- tacharya, A., Majumder, P.: FIRE 2019 AILA Track: Artificial Intelligence for Legal Assistance. In: Proceedings of the 11th Forum for Information Retrieval Evaluation, pp. 4–6 (2019). https://doi.org/10.1145/3368567.3368587 . https://doi.org/10.1145/3368567.3368587

  18. [18]

    The Review of Socionetwork Strategies16, 111–133 (2022) https://doi.org/10.1007/s12626-022-00105-z

    Rabelo, J., Goebel, R., Kim, M.-Y., Yoshioka, M., Kano, Y., Satoh, K.: Overview and Discussion of the Competition on Legal Information Extraction/Entailment 21 (COLIEE) 2021. The Review of Socionetwork Strategies16, 111–133 (2022) https://doi.org/10.1007/s12626-022-00105-z

  19. [19]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Wang, L., Yang, N., Huang, X., Yang, L., Gao, F., Wei, Z., Zhang, Y., Zhou, M., et al.: Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022)

  20. [20]

    Portal institucional

    Tribunal de Contas da União (TCU): Jurisprudência (Portal do TCU). Portal institucional. Accessed: 2026-03-09 (2026). https://portal.tcu.gov.br/ jurisprudencia/

  21. [21]

    Por- tal de dados abertos

    Tribunal de Contas da União (TCU): Dados abertos: Jurisprudência. Por- tal de dados abertos. Accessed: 2026-03-09 (2026). https://sites.tcu.gov.br/ dados-abertos/jurisprudencia/

  22. [22]

    In: Anais do XVI Simpósio Brasileiro de Tecnologia da Informação e da Lin- guagem Humana, pp

    Brito, E.d., Teotonio, M., Lotufo, R., Pereira, J.: Avaliando Ferramentas de IA Generativa no Conjunto de Perguntas e Respostas da Receita Federal. In: Anais do XVI Simpósio Brasileiro de Tecnologia da Informação e da Lin- guagem Humana, pp. 688–692 (2025). https://doi.org/10.5753/stil.2025.37872 . https://sol.sbc.org.br/index.php/stil/article/view/37872

  23. [23]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv preprint arXiv:1606.05250 (2016)

  24. [24]

    Zhao, Y., Huang, J., Hu, J., Wang, X., Mao, Y., Zhang, D., Zhang, H., Jiang, Z., Wu, Z., Ai, B., Wang, A., Zhou, W., Chen, Y., Team, M.: Swift: A scalable lightweightinfrastructureforfine-tuning.arXivpreprintarXiv:2408.05517(2024)

  25. [25]

    Cumulated gain-based evaluation of IR techniques,

    Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems20(4), 422–446 (2002) https://doi. org/10.1145/582415.582418

  26. [26]

    Voorhees and Dawn M

    Voorhees, E.M., Tice, D.M.: Building a question answering test collection. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’00, pp. 200–207. Association for Computing Machinery, New York, NY, USA (2000). https://doi. org/10.1145/345508.345577 .https://doi.org/10.1145/345508.345577

  27. [27]

    Cambridge University Press, Cambridge (2008)

    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

  28. [28]

    Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

    Thakur, N., Bonifacio, L., Fröbe, M., Bondarenko, A., Kamalloo, E., Pot- thast, M., Hagen, M., Lin, J.: Systematic evaluation of neural retrieval models on the Touché 2020 argument retrieval subset of BEIR. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval (2024). https://doi.org/10.1145/...