pith. machine review for the scientific record. sign in

arxiv: 2604.16337 · v1 · submitted 2026-03-13 · 💻 cs.IR · cs.AI· cs.CY

Recognition: no theorem link

HR-Agents: Using Multiple LLM-based Agents to Improve Q&A about Brazilian Labor Legislation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:44 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CY
keywords multi-agent systemslarge language modelsretrieval-augmented generationBrazilian labor lawquestion answeringHR complianceCrewAIlegal assistance
0
0 comments X

The pith

A multi-agent LLM system improves response coherence and correctness for Brazilian labor law questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a multi-agent system using large language models to answer questions about Brazilian labor legislation, specifically the CLT framework. Specialized agents handle different aspects of employment law and collaborate through CrewAI while incorporating retrieval-augmented generation for better context. This setup is tested against a standard single-LLM RAG approach using metrics like BLEU scores, LLM-as-judge ratings, and some expert reviews. A sympathetic reader would care because it addresses real inefficiencies in HR compliance work, where complex rules often cause delays and errors. If effective, it offers a practical way to make legal Q&A faster and more reliable.

Core claim

The authors establish that deploying multiple LLM-based agents, each focused on distinct tasks such as retrieval, analysis, and validation, within a RAG-enhanced framework leads to improved coherence and correctness in responses to queries about Brazilian labor law when compared to a single LLM baseline.

What carries the argument

A CrewAI-orchestrated multi-agent architecture where specialized LLM agents cooperate on aspects of labor law Q&A, integrated with RAG for contextual relevance.

Load-bearing premise

Automated metrics like BLEU and LLM-as-judge, plus limited expert input, reliably indicate legal accuracy and usefulness for actual HR professionals dealing with live questions.

What would settle it

Direct comparison of error rates in responses when HR professionals apply the system versus the baseline in real compliance cases.

Figures

Figures reproduced from arXiv: 2604.16337 by Abriel K. Moraes, Amparo Munoz, Charles S. Oliveira, Erik Soares, Fabiana C. Q. de O. Marucci, Gabriel S. M. Dias, Gabriel U. Talasso, Leonardo R. do Nascimento, Leonardo T. dos Santos, Lucas D. Gessoni, Maria L. A. de S. Cruvinel, Matheus H. R. Vicente, Renata De Paris, Sildolfo Gomes, Vitor G. C. B. de Farias, Vitor L. Fabris, Wandemberg Gibaut.

Figure 3
Figure 3. Figure 3: GPT 4o Answer Similarity [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 1
Figure 1. Figure 1: GPT 4o BLEU evaluation [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Llama 3.1 8B BLEU evaluation Figures 3 and 4 display the answer similarity scores, indi￾cating that GPT-4o outperformed Llama within the Cre￾wAI agent configuration, although both models demon￾strated satisfactory lexical alignment with the ground truth across both RAG and CrewAI settings. These results indicate that, in the tests conducted, both models were generally capable of producing responses contain… view at source ↗
Figure 5
Figure 5. Figure 5: GPT 4o Answer Correctness results [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Llama 3.1 8B Answer Correctness results Precision Quality mean 6.47 9.75 median 8.00 10.00 std 3.43 0.78 var 11.79 0.61 min 0.00 6.00 max 10.00 10.00 range 10.00 4.00 cv (%) 53.06 8.05 [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
read the original abstract

The Consolidation of Labor Laws (CLT) serves as the primary legal framework governing labor relations in Brazil, ensuring essential protections for workers. However, its complexity creates challenges for Human Resources (HR) professionals in navigating regulations and ensuring compliance. Traditional methods for addressing labor law inquiries often lead to inefficiencies, delays, and inconsistencies. To enhance the accuracy and efficiency of legal question-answering (Q&A), a multi-agent system powered by Large Language Models (LLMs) is introduced. This approach employs specialized agents to address distinct aspects of employment law while integrating Retrieval-Augmented Generation (RAG) to enhance contextual relevance. Implemented using CrewAI, the system enables cooperative agent interactions, ensuring response validation and reducing misinformation. The effectiveness of this framework is evaluated through a comparison with a baseline RAG pipeline utilizing a single LLM, using automated metrics such as BLEU, LLM-as-judge evaluations, and expert human assessments. Results indicate that the multi-agent approach improves response coherence and correctness, providing a more reliable and efficient solution for HR professionals. This study contributes to AI-driven legal assistance by demonstrating the potential of multi-agent LLM architectures in improving labor law compliance and streamlining HR operations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents HR-Agents, a multi-agent LLM system implemented with CrewAI for Q&A on Brazilian labor legislation (CLT). Specialized agents handle distinct aspects of employment law, integrated with RAG for contextual relevance and response validation. It is evaluated against a single-LLM RAG baseline using BLEU, LLM-as-judge, and expert human assessments, claiming improvements in coherence and correctness for HR professionals.

Significance. If the evaluation metrics reliably indicate reduced risk of incorrect legal advice, the multi-agent framework could significantly aid HR professionals in complying with complex Brazilian labor laws, offering efficiency gains over traditional methods. The work demonstrates the potential of cooperative LLM agents in legal domains, though the practical impact hinges on the validity of the chosen evaluation methods for high-stakes accuracy.

major comments (3)
  1. [Abstract] The abstract reports metric improvements using BLEU, LLM-as-judge evaluations, and expert assessments but provides no details on dataset size, question difficulty distribution, statistical significance, or potential confounds in the human evaluation. This omission is load-bearing because it prevents assessment of whether the claimed gains in coherence and correctness translate to lower compliance risk for HR professionals.
  2. [Evaluation] BLEU is employed as a primary automated metric, yet it penalizes valid paraphrases of legal text while rewarding superficial n-gram overlap; this choice undermines the claim of improved correctness for CLT questions (e.g., overtime rules or threshold values) without additional validation that BLEU tracks legal accuracy.
  3. [Evaluation] LLM-as-judge ratings and limited expert assessments lack reported inter-rater reliability, question sampling frame, or error typology (e.g., misstated regulations). Without these, the results do not establish that the multi-agent outputs are materially safer than the single-LLM RAG baseline for live HR compliance queries.
minor comments (2)
  1. [Abstract] The abstract could more explicitly state the number and specialized roles of the agents to improve immediate clarity for readers.
  2. Consider including a summary table of all quantitative metric values (BLEU, LLM-judge scores, human ratings) with confidence intervals to facilitate direct comparison with the baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We provide point-by-point responses to the major comments and indicate the revisions we will implement in the next version of the paper.

read point-by-point responses
  1. Referee: [Abstract] The abstract reports metric improvements using BLEU, LLM-as-judge evaluations, and expert assessments but provides no details on dataset size, question difficulty distribution, statistical significance, or potential confounds in the human evaluation. This omission is load-bearing because it prevents assessment of whether the claimed gains in coherence and correctness translate to lower compliance risk for HR professionals.

    Authors: We agree that the abstract should provide more context on the evaluation setup. In the revised version, we will include details on the dataset comprising 50 questions with a balanced distribution across difficulty levels and CLT topics, report statistical significance tests (e.g., paired t-tests with p-values), and outline the human evaluation process involving three HR experts to address potential confounds such as evaluator bias. revision: yes

  2. Referee: [Evaluation] BLEU is employed as a primary automated metric, yet it penalizes valid paraphrases of legal text while rewarding superficial n-gram overlap; this choice undermines the claim of improved correctness for CLT questions (e.g., overtime rules or threshold values) without additional validation that BLEU tracks legal accuracy.

    Authors: We recognize the shortcomings of BLEU in evaluating legal Q&A where paraphrasing is common and accuracy is semantic rather than lexical. Although BLEU served as a supplementary automated metric, our conclusions primarily draw from LLM-as-judge and expert assessments. We will revise the evaluation section to explicitly discuss BLEU's limitations in this domain and incorporate additional metrics like BERTScore for better alignment with legal correctness. Examples of responses will be added to illustrate cases where lower BLEU scores still correspond to accurate legal advice. revision: partial

  3. Referee: [Evaluation] LLM-as-judge ratings and limited expert assessments lack reported inter-rater reliability, question sampling frame, or error typology (e.g., misstated regulations). Without these, the results do not establish that the multi-agent outputs are materially safer than the single-LLM RAG baseline for live HR compliance queries.

    Authors: We agree that these details are essential for robust claims. The revised manuscript will report inter-rater reliability using Cohen's kappa for the expert assessments, specify the question sampling as a stratified random sample of 50 queries from a larger pool covering key CLT areas, and include an error typology breaking down issues like regulatory misstatements, omissions, and coherence problems. This analysis will help quantify the safety improvements for HR compliance. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison without derivation or fitted inputs

full rationale

The paper reports an empirical evaluation of a multi-agent LLM system (built with CrewAI and RAG) versus a single-LLM baseline on Brazilian labor-law Q&A. Metrics (BLEU, LLM-as-judge, expert ratings) are applied directly to generated outputs; no equations, parameter fitting, predictions derived from the same data, or self-citation chains appear in the derivation. The central claim reduces to experimental results rather than any self-referential construction, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms are introduced; the work rests on the domain assumption that LLMs can be prompted to act as specialized legal agents.

axioms (1)
  • domain assumption LLMs can be effectively specialized for distinct legal subtasks via prompting and agent roles
    Central to the multi-agent design described in the abstract.

pith-pipeline@v0.9.0 · 5617 in / 1077 out tokens · 43833 ms · 2026-05-15T11:44:48.530163+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 7 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter doi edition editor eid howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sent...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := ...

  4. [4]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...

  5. [5]

    AI, M. (2024). Llama 3.1 8b. https://huggingface.co/meta-llama/Llama-3.1-8B

  6. [6]

    Blair-Stanek, A., Holzenberger, N., and Van Durme, B. (2023). Can gpt-3 perform statutory reasoning? In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, 22--31

  7. [7]

    Consolidação das leis do trabalho

    Brasil (1943). Consolidação das leis do trabalho. Decreto-Lei nº 5.452, de 1º de maio de 1943

  8. [8]

    Brown, T.B. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165

  9. [9]

    and Del Mônaco, M

    Cabral, A.A. and Del Mônaco, M. (2015). O direito civil e a sua aplicação ao direito do trabalho: Abordagem histórica e dogmática. PDF file. ://as1.trt3.jus.br/bd-trt3/handle/11103/27278

  10. [10]

    Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70), 1--53

  11. [11]

    Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazar \'e , P.E., Lomeli, M., Hosseini, L., and J \'e gou, H. (2024). The faiss library. arXiv preprint arXiv:2401.08281

  12. [12]

    Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, H., and Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2

  13. [13]

    Gibaut, W. (2024). Periquito-3b: Modelo de linguagem em português. https://huggingface.co/wandgibaut/periquito-3B. Modelo de linguagem baseado em LLaMA 2-3B, ajustado para compreensão e geração de texto em português

  14. [14]

    Lin, C.Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74--81

  15. [15]

    and Hovy, E

    Lin, C.Y. and Hovy, E. (2003). Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics, 150--157

  16. [16]

    and Och, F.J

    Lin, C.Y. and Och, F.J. (2004). Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), 605--612

  17. [17]

    Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., and Gao, J. (2024). Large language models: A survey. arXiv preprint arXiv:2402.06196

  18. [18]

    NLP, R. (2024). Bode 7b - alpaca pt-br. https://huggingface.co/recogna-nlp/bode-7b-alpaca-pt-br. Modelo de linguagem baseado no LLaMA 2-7B, ajustado para instruções em português

  19. [19]

    Gpt-4o mini: Avançando a inteligência de forma econômica

    OpenAI (2024). Gpt-4o mini: Avançando a inteligência de forma econômica. ://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/. Accessed: 2024-11-21

  20. [20]

    Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311--318

  21. [21]

    Sales Almeida, T., Abonizio, H., Nogueira, R., and Pires, R. (2024). Sabi \'a -2: A new generation of portuguese large language models. arXiv e-prints, arXiv--2403

  22. [22]

    Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. (2024). Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36

  23. [23]

    Silva, M.d. (2017). A justiça do trabalho: Importância e desafios em 76 anos de história. Tribunal Regional do Trabalho da 3º Região de Minas Gerais. ://sistemas.trt3.jus.br/bd-trt3/handle/11103/27501

  24. [24]

    Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023

    Talebirad, Y. and Nadiri, A. (2023). Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314

  25. [25]

    Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems

  26. [26]

    Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and Wei, F. (2024). Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672

  27. [27]

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824--24837

  28. [28]

    Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., and Wang, C. (2023). Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155

  29. [29]

    Zhang, Y., Mao, S., Ge, T., Wang, X., de Wynter, A., Xia, Y., Wu, W., Song, T., Lan, M., and Wei, F. (2024). Llm as a mastermind: A survey of strategic reasoning with large language models. arXiv preprint arXiv:2404.01230

  30. [30]

    Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223