arxiv: 2604.16337 · v1 · submitted 2026-03-13 · 💻 cs.IR · cs.AI· cs.CY

Recognition: no theorem link

HR-Agents: Using Multiple LLM-based Agents to Improve Q&A about Brazilian Labor Legislation

Abriel K. Moraes , Gabriel S. M. Dias , Vitor L. Fabris , Lucas D. Gessoni , Leonardo R. do Nascimento , Charles S. Oliveira , Vitor G. C. B. de Farias , Fabiana C. Q. de O. Marucci

show 9 more authors

Matheus H. R. Vicente Gabriel U. Talasso Erik Soares Amparo Munoz Sildolfo Gomes Maria L. A. de S. Cruvinel Leonardo T. dos Santos Renata De Paris Wandemberg Gibaut

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:44 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CY

keywords multi-agent systemslarge language modelsretrieval-augmented generationBrazilian labor lawquestion answeringHR complianceCrewAIlegal assistance

0 comments

The pith

A multi-agent LLM system improves response coherence and correctness for Brazilian labor law questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a multi-agent system using large language models to answer questions about Brazilian labor legislation, specifically the CLT framework. Specialized agents handle different aspects of employment law and collaborate through CrewAI while incorporating retrieval-augmented generation for better context. This setup is tested against a standard single-LLM RAG approach using metrics like BLEU scores, LLM-as-judge ratings, and some expert reviews. A sympathetic reader would care because it addresses real inefficiencies in HR compliance work, where complex rules often cause delays and errors. If effective, it offers a practical way to make legal Q&A faster and more reliable.

Core claim

The authors establish that deploying multiple LLM-based agents, each focused on distinct tasks such as retrieval, analysis, and validation, within a RAG-enhanced framework leads to improved coherence and correctness in responses to queries about Brazilian labor law when compared to a single LLM baseline.

What carries the argument

A CrewAI-orchestrated multi-agent architecture where specialized LLM agents cooperate on aspects of labor law Q&A, integrated with RAG for contextual relevance.

Load-bearing premise

Automated metrics like BLEU and LLM-as-judge, plus limited expert input, reliably indicate legal accuracy and usefulness for actual HR professionals dealing with live questions.

What would settle it

Direct comparison of error rates in responses when HR professionals apply the system versus the baseline in real compliance cases.

Figures

Figures reproduced from arXiv: 2604.16337 by Abriel K. Moraes, Amparo Munoz, Charles S. Oliveira, Erik Soares, Fabiana C. Q. de O. Marucci, Gabriel S. M. Dias, Gabriel U. Talasso, Leonardo R. do Nascimento, Leonardo T. dos Santos, Lucas D. Gessoni, Maria L. A. de S. Cruvinel, Matheus H. R. Vicente, Renata De Paris, Sildolfo Gomes, Vitor G. C. B. de Farias, Vitor L. Fabris, Wandemberg Gibaut.

**Figure 1.** Figure 1: GPT 4o BLEU evaluation [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Llama 3.1 8B BLEU evaluation Figures 3 and 4 display the answer similarity scores, indicating that GPT-4o outperformed Llama within the CrewAI agent configuration, although both models demonstrated satisfactory lexical alignment with the ground truth across both RAG and CrewAI settings. These results indicate that, in the tests conducted, both models were generally capable of producing responses contain… view at source ↗

**Figure 5.** Figure 5: GPT 4o Answer Correctness results [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Llama 3.1 8B Answer Correctness results Precision Quality mean 6.47 9.75 median 8.00 10.00 std 3.43 0.78 var 11.79 0.61 min 0.00 6.00 max 10.00 10.00 range 10.00 4.00 cv (%) 53.06 8.05 [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

read the original abstract

The Consolidation of Labor Laws (CLT) serves as the primary legal framework governing labor relations in Brazil, ensuring essential protections for workers. However, its complexity creates challenges for Human Resources (HR) professionals in navigating regulations and ensuring compliance. Traditional methods for addressing labor law inquiries often lead to inefficiencies, delays, and inconsistencies. To enhance the accuracy and efficiency of legal question-answering (Q&A), a multi-agent system powered by Large Language Models (LLMs) is introduced. This approach employs specialized agents to address distinct aspects of employment law while integrating Retrieval-Augmented Generation (RAG) to enhance contextual relevance. Implemented using CrewAI, the system enables cooperative agent interactions, ensuring response validation and reducing misinformation. The effectiveness of this framework is evaluated through a comparison with a baseline RAG pipeline utilizing a single LLM, using automated metrics such as BLEU, LLM-as-judge evaluations, and expert human assessments. Results indicate that the multi-agent approach improves response coherence and correctness, providing a more reliable and efficient solution for HR professionals. This study contributes to AI-driven legal assistance by demonstrating the potential of multi-agent LLM architectures in improving labor law compliance and streamlining HR operations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multi-agent CrewAI setup for Brazilian labor law Q&A beats single RAG on BLEU and LLM-judge scores but does not establish real legal accuracy gains.

read the letter

The main point is a straightforward domain application: they built a multi-agent system in CrewAI with specialized agents for different CLT topics plus RAG, and it scores higher than a single-LLM RAG baseline on BLEU, LLM-as-judge, and limited expert review. No new algorithms or frameworks appear; the contribution is the specific setup and the reported metric lift for this jurisdiction. That is useful as far as it goes for people who need a working HR tool in Brazil. The agent division for validation and topic coverage is a sensible engineering choice and the inclusion of human experts in assessment is better than auto-metrics alone. The paper stays focused on practical HR compliance rather than broad claims. The soft spots sit in the evaluation. BLEU rewards n-gram overlap that may not match legal correctness, LLM judges can repeat the same hallucination patterns the system aims to fix, and the abstract gives no dataset size, sampling method, statistical tests, or error breakdown. Without those, the claim that responses are materially safer for live compliance questions rests on weak evidence. The work is mainly for applied legal-tech groups or HR teams in regulated domains who want implementation ideas rather than for researchers tracking core multi-agent progress. It deserves a serious referee because the baseline comparison is concrete and the practical setting is clear, even though the metrics section will need expansion and stronger validation of legal reliability.

Referee Report

3 major / 2 minor

Summary. The manuscript presents HR-Agents, a multi-agent LLM system implemented with CrewAI for Q&A on Brazilian labor legislation (CLT). Specialized agents handle distinct aspects of employment law, integrated with RAG for contextual relevance and response validation. It is evaluated against a single-LLM RAG baseline using BLEU, LLM-as-judge, and expert human assessments, claiming improvements in coherence and correctness for HR professionals.

Significance. If the evaluation metrics reliably indicate reduced risk of incorrect legal advice, the multi-agent framework could significantly aid HR professionals in complying with complex Brazilian labor laws, offering efficiency gains over traditional methods. The work demonstrates the potential of cooperative LLM agents in legal domains, though the practical impact hinges on the validity of the chosen evaluation methods for high-stakes accuracy.

major comments (3)

[Abstract] The abstract reports metric improvements using BLEU, LLM-as-judge evaluations, and expert assessments but provides no details on dataset size, question difficulty distribution, statistical significance, or potential confounds in the human evaluation. This omission is load-bearing because it prevents assessment of whether the claimed gains in coherence and correctness translate to lower compliance risk for HR professionals.
[Evaluation] BLEU is employed as a primary automated metric, yet it penalizes valid paraphrases of legal text while rewarding superficial n-gram overlap; this choice undermines the claim of improved correctness for CLT questions (e.g., overtime rules or threshold values) without additional validation that BLEU tracks legal accuracy.
[Evaluation] LLM-as-judge ratings and limited expert assessments lack reported inter-rater reliability, question sampling frame, or error typology (e.g., misstated regulations). Without these, the results do not establish that the multi-agent outputs are materially safer than the single-LLM RAG baseline for live HR compliance queries.

minor comments (2)

[Abstract] The abstract could more explicitly state the number and specialized roles of the agents to improve immediate clarity for readers.
Consider including a summary table of all quantitative metric values (BLEU, LLM-judge scores, human ratings) with confidence intervals to facilitate direct comparison with the baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We provide point-by-point responses to the major comments and indicate the revisions we will implement in the next version of the paper.

read point-by-point responses

Referee: [Abstract] The abstract reports metric improvements using BLEU, LLM-as-judge evaluations, and expert assessments but provides no details on dataset size, question difficulty distribution, statistical significance, or potential confounds in the human evaluation. This omission is load-bearing because it prevents assessment of whether the claimed gains in coherence and correctness translate to lower compliance risk for HR professionals.

Authors: We agree that the abstract should provide more context on the evaluation setup. In the revised version, we will include details on the dataset comprising 50 questions with a balanced distribution across difficulty levels and CLT topics, report statistical significance tests (e.g., paired t-tests with p-values), and outline the human evaluation process involving three HR experts to address potential confounds such as evaluator bias. revision: yes
Referee: [Evaluation] BLEU is employed as a primary automated metric, yet it penalizes valid paraphrases of legal text while rewarding superficial n-gram overlap; this choice undermines the claim of improved correctness for CLT questions (e.g., overtime rules or threshold values) without additional validation that BLEU tracks legal accuracy.

Authors: We recognize the shortcomings of BLEU in evaluating legal Q&A where paraphrasing is common and accuracy is semantic rather than lexical. Although BLEU served as a supplementary automated metric, our conclusions primarily draw from LLM-as-judge and expert assessments. We will revise the evaluation section to explicitly discuss BLEU's limitations in this domain and incorporate additional metrics like BERTScore for better alignment with legal correctness. Examples of responses will be added to illustrate cases where lower BLEU scores still correspond to accurate legal advice. revision: partial
Referee: [Evaluation] LLM-as-judge ratings and limited expert assessments lack reported inter-rater reliability, question sampling frame, or error typology (e.g., misstated regulations). Without these, the results do not establish that the multi-agent outputs are materially safer than the single-LLM RAG baseline for live HR compliance queries.

Authors: We agree that these details are essential for robust claims. The revised manuscript will report inter-rater reliability using Cohen's kappa for the expert assessments, specify the question sampling as a stratified random sample of 50 queries from a larger pool covering key CLT areas, and include an error typology breaking down issues like regulatory misstatements, omissions, and coherence problems. This analysis will help quantify the safety improvements for HR compliance. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison without derivation or fitted inputs

full rationale

The paper reports an empirical evaluation of a multi-agent LLM system (built with CrewAI and RAG) versus a single-LLM baseline on Brazilian labor-law Q&A. Metrics (BLEU, LLM-as-judge, expert ratings) are applied directly to generated outputs; no equations, parameter fitting, predictions derived from the same data, or self-citation chains appear in the derivation. The central claim reduces to experimental results rather than any self-referential construction, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or non-standard axioms are introduced; the work rests on the domain assumption that LLMs can be prompted to act as specialized legal agents.

axioms (1)

domain assumption LLMs can be effectively specialized for distinct legal subtasks via prompting and agent roles
Central to the multi-agent design described in the abstract.

pith-pipeline@v0.9.0 · 5617 in / 1077 out tokens · 43833 ms · 2026-05-15T11:44:48.530163+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 7 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter doi edition editor eid howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sent...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := ...

work page
[4]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...

work page
[5]

AI, M. (2024). Llama 3.1 8b. https://huggingface.co/meta-llama/Llama-3.1-8B

work page 2024
[6]

Blair-Stanek, A., Holzenberger, N., and Van Durme, B. (2023). Can gpt-3 perform statutory reasoning? In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, 22--31

work page 2023
[7]

Consolidação das leis do trabalho

Brasil (1943). Consolidação das leis do trabalho. Decreto-Lei nº 5.452, de 1º de maio de 1943

work page 1943
[8]

Brown, T.B. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165

work page internal anchor Pith review Pith/arXiv arXiv 2020
[9]

and Del Mônaco, M

Cabral, A.A. and Del Mônaco, M. (2015). O direito civil e a sua aplicação ao direito do trabalho: Abordagem histórica e dogmática. PDF file. ://as1.trt3.jus.br/bd-trt3/handle/11103/27278

work page 2015
[10]

Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70), 1--53

work page 2024
[11]

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazar \'e , P.E., Lomeli, M., Hosseini, L., and J \'e gou, H. (2024). The faiss library. arXiv preprint arXiv:2401.08281

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, H., and Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Gibaut, W. (2024). Periquito-3b: Modelo de linguagem em português. https://huggingface.co/wandgibaut/periquito-3B. Modelo de linguagem baseado em LLaMA 2-3B, ajustado para compreensão e geração de texto em português

work page 2024
[14]

Lin, C.Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74--81

work page 2004
[15]

and Hovy, E

Lin, C.Y. and Hovy, E. (2003). Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics, 150--157

work page 2003
[16]

and Och, F.J

Lin, C.Y. and Och, F.J. (2004). Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), 605--612

work page 2004
[17]

Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., and Gao, J. (2024). Large language models: A survey. arXiv preprint arXiv:2402.06196

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

NLP, R. (2024). Bode 7b - alpaca pt-br. https://huggingface.co/recogna-nlp/bode-7b-alpaca-pt-br. Modelo de linguagem baseado no LLaMA 2-7B, ajustado para instruções em português

work page 2024
[19]

Gpt-4o mini: Avançando a inteligência de forma econômica

OpenAI (2024). Gpt-4o mini: Avançando a inteligência de forma econômica. ://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/. Accessed: 2024-11-21

work page 2024
[20]

Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311--318

work page 2002
[21]

Sales Almeida, T., Abonizio, H., Nogueira, R., and Pires, R. (2024). Sabi \'a -2: A new generation of portuguese large language models. arXiv e-prints, arXiv--2403

work page 2024
[22]

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. (2024). Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36

work page 2024
[23]

Silva, M.d. (2017). A justiça do trabalho: Importância e desafios em 76 anos de história. Tribunal Regional do Trabalho da 3º Região de Minas Gerais. ://sistemas.trt3.jus.br/bd-trt3/handle/11103/27501

work page 2017
[24]

Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023

Talebirad, Y. and Nadiri, A. (2023). Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314

work page arXiv 2023
[25]

Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems

work page 2017
[26]

Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and Wei, F. (2024). Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824--24837

work page 2022
[28]

Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li, B., Jiang, L., Zhang, X., and Wang, C. (2023). Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Zhang, Y., Mao, S., Ge, T., Wang, X., de Wynter, A., Xia, Y., Wu, W., Song, T., Lan, M., and Wei, F. (2024). Llm as a mastermind: A survey of strategic reasoning with large language models. arXiv preprint arXiv:2404.01230

work page arXiv 2024
[30]

Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223

work page internal anchor Pith review Pith/arXiv arXiv 2023