Learning to Label: A Reinforced Self-Evolving Framework for Semi-supervised Referring Expression Segmentation

Chuanwei Zhou; Chunyan Xu; Runlong Cao; Tianrun Chen; Tong Zhang; Ying Zang; Zhen Cui

arxiv: 2605.28239 · v1 · pith:C7GTWF2Vnew · submitted 2026-05-27 · 💻 cs.CV

Learning to Label: A Reinforced Self-Evolving Framework for Semi-supervised Referring Expression Segmentation

Runlong Cao , Ying Zang , Chuanwei Zhou , Tianrun Chen , Tong Zhang , Zhen Cui , Chunyan Xu This is my paper

Pith reviewed 2026-06-29 12:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords semi-supervised segmentationreferring expression segmentationpseudo-labelingreinforcement learningmultimodal priorsself-evolving frameworkpixel-level language grounding

0 comments

The pith

Casting pseudo-label selection as reinforcement learning creates a self-evolving loop that jointly improves the model and its supervision signals for referring expression segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the problem of unreliable pseudo-labels in semi-supervised referring expression segmentation when only limited image-text pairs are annotated. It introduces a framework that extracts semantic-spatial priors from a multimodal large language model to create initial soft proposals and guidance signals for a hierarchical segmentation network. Pseudo-label construction is reframed as an exploratory decision process where reinforcement learning rewards selections that provide high-utility pixel supervision based on both priors and current model predictions. This forms a closed loop in which the segmentation model and the pseudo-labels are optimized together, increasing label reliability over time. Experiments on standard benchmarks show gains over prior semi-supervised methods.

Core claim

The central claim is that reinforced pseudo-label selection, formulated as an exploratory decision process that adaptively rewards high-utility pixel-level supervision based on multimodal priors and model predictions, enables a self-evolving loop for joint optimization of the segmentation model and pseudo-labels, progressively enhancing label reliability under sparse supervision.

What carries the argument

Reinforced pseudo-label selection as an exploratory decision process that adaptively rewards high-utility pixel-level supervision based on multimodal priors and model predictions.

If this is right

Joint optimization of model and labels produces progressively more reliable supervision signals.
The framework yields measurable accuracy gains over existing semi-supervised methods on RefCOCO, RefCOCO+, and RefCOCOg.
Multimodal priors from an MLLM can be elevated into learnable guidance that conditions the segmentation network.
The approach maintains generalization when supervision is limited to sparse image-text pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reinforced selection loop could be tested on other pixel-level grounding tasks such as referring video segmentation.
Replacing the MLLM prior extractor with a lighter vision-language model might reduce compute while preserving most gains.
Monitoring the variance of reward signals across iterations could serve as an early indicator of training stability.

Load-bearing premise

The reinforcement mechanism for choosing pseudo-labels will produce stable improvements rather than unstable or collapsing learning dynamics.

What would settle it

Train the framework for multiple iterations on RefCOCO and measure whether the average IoU of selected pseudo-labels against a held-out ground-truth subset rises steadily or eventually declines.

Figures

Figures reproduced from arXiv: 2605.28239 by Chuanwei Zhou, Chunyan Xu, Runlong Cao, Tianrun Chen, Tong Zhang, Ying Zang, Zhen Cui.

**Figure 2.** Figure 2: Performance comparisons on RefCOCO, RefCOCO+, and RefCOCOg datasets at different label rates. enables joint optimization of the segmentation model and pseudo-labels, progressively enhancing label reliability under sparse supervision. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate improvements over existing methods, validating its effectiveness and generalization. Code is … view at source ↗

**Figure 3.** Figure 3: Overview of the proposed Learning to Label (L2L) framework. For each unlabeled image–ref pair, a frozen MLLM predicts referring grounding cues and prompts SAM2 to generate a soft segmentation prior, which is then uncertainty-adaptively calibrated by SPM using the model prediction under weak augmentation to obtain P˜† . The calibrated prior provides structured conditional guidance to the segmentor via SESM.… view at source ↗

**Figure 4.** Figure 4: Qualitative analysis of Ground Truth, L2L, RESMatch, Baseline (w/o MLLM), and Baseline (w/ MLLM) on RefCOCO under the 5% semi-supervised setting. The white numbers indicate the IoU with the ground-truth mask. Typical failure regions are highlighted with red dashed boxes and incorrect results are marked in red; correct segmentations are marked in green [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity to a fixed foreground threshold. Overall IoU on RefCOCO/RefCOCO+/RefCOCOg under the 5% labeled setting when replacing RPLE with a static threshold τfg. RPLE results are shown as dashed horizontal lines, achieving comparable or better performance without manual threshold tuning. The vertical dotted line marks the default fixed threshold τfg = 0.7. pares different structural-stream configuration… view at source ↗

**Figure 6.** Figure 6: Qualitative visualization of strong augmentations. We show representative examples of strong image augmentations and strong text augmentations applied to the same unlabeled sample. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Labeled and unlabeled data statistics under four label budgets on RefCOCO, RefCOCO+ and RefCOCOg. Following the semi-supervised RES protocol in (Sun et al., 2023; Yang et al., 2024; Zang et al., 2025), we study the in-distribution setting where the official training split is partitioned into a labeled subset and an unlabeled subset drawn from the same data distribution. We consider four label budgets, name… view at source ↗

**Figure 8.** Figure 8: Confidence map visualization on RefCOCO under the 5% semi-supervised setting. From left to right: referring expression (Ref), input image (Image), ground-truth mask (Ground Truth), full model (Ours), ablation without SESM (w/o SESM), ablation without RPLE (w/o RPLE), and the baseline. Confidence is rendered in grayscale, where brighter pixels indicate higher confidence. Compared with the baseline and ablat… view at source ↗

**Figure 9.** Figure 9: Temporal evolution of pseudo supervision quality over the 40-epoch training process. The MLLM-guided prior P † i remains fixed, while the calibrated fused prior P˜† i and the selected-region accuracy progressively improve during training. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative analysis of L2L on RefCOCO under the 5% semi-supervised setting. The white numbers indicate the IoU with the ground-truth mask. Typical failure regions are highlighted with red dashed boxes, and incorrect results are marked in red. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: More Qualitative analysis of Ground Truth, L2L, RESMatch, Baseline (w/o MLLM), and Baseline (w/ MLLM) on RefCOCO under the 5% semi-supervised setting. The white numbers indicate the IoU with the ground-truth mask. Typical failure regions are highlighted with red dashed boxes, and incorrect results are marked in red; correct segmentations are marked in green. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: More Qualitative analysis of Ground Truth, L2L, RESMatch, Baseline (w/o MLLM), and Baseline (w/ MLLM) on RefCOCO+ under the 5% semi-supervised setting. The white numbers indicate the IoU with the ground-truth mask. Typical failure regions are highlighted with red dashed boxes, and incorrect results are marked in red; correct segmentations are marked in green. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: More Qualitative analysis of Ground Truth, L2L, RESMatch, Baseline (w/o MLLM), and Baseline (w/ MLLM) on RefCOCOg under the 5% semi-supervised setting. The white numbers indicate the IoU with the ground-truth mask. Typical failure regions are highlighted with red dashed boxes, and incorrect results are marked in red; correct segmentations are marked in green. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

read the original abstract

Semi-supervised referring expression segmentation (SS-RES) aims to achieve precise pixel-level language grounding under limited annotation, yet suffers from limited supervision and unreliable pseudo-labels when exploiting unlabeled image-text pairs. In this work, we propose Learning to Label, a reinforced self-evolving framework (L2L) that casts pseudo-label construction as a learnable decision-making process. To build foundational understanding, we leverage a multimodal large language model to extract semantic-spatial priors, which are instantiated as initial soft segmentation proposals and elevated, together with textual cues, into learnable guidance signals that condition a hierarchical segmentation network. To ensure stable learning, reinforced pseudo-label selection is formulated as an exploratory decision process that adaptively rewards high-utility pixel-level supervision based on multimodal priors and model predictions. This reinforced self-evolving loop enables joint optimization of the segmentation model and pseudo-labels, progressively enhancing label reliability under sparse supervision. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate improvements over existing methods, validating its effectiveness and generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is casting pseudo-label selection as an RL decision process on top of MLLM priors for semi-supervised referring expression segmentation, but the abstract supplies no equations or stability details for that loop.

read the letter

The core idea is to treat pseudo-label construction as a learnable exploratory process rather than a static heuristic. It starts with multimodal LLM priors turned into soft proposals, feeds those plus text into a hierarchical segmenter, and then uses reinforcement to pick which pixel-level signals are worth keeping in each round. The self-evolving loop is meant to jointly improve the model and the labels under sparse supervision. That specific combination for referring expression segmentation is the part that looks new.

The experiments claim gains on RefCOCO, RefCOCO+, and RefCOCOg, which at least shows the authors ran the usual benchmarks and saw movement. For readers who already work on semi-supervised vision-language grounding, this could be a useful data point on whether RL selection helps more than standard pseudo-labeling tricks.

The weak part is exactly what the stress-test note flags: the abstract gives no state or action definitions, no reward function, and no mention of how they avoid confirmation bias or reward sparsity when the action space is per-pixel. In high-dimensional segmentation, an RL loop can easily lock in early errors instead of correcting them. Without those mechanics shown, the central claim that the reinforced selection produces stable progressive improvement rests on unverified assumptions. The circularity burden looks low because the method brings in external MLLM signals, but that does not fix the missing implementation details.

This is for people already inside semi-supervised RES or related label-refinement work who need concrete baselines. A reader outside that niche will not get much. It is worth sending to a serious referee because the problem is real and the framing is coherent on its own terms, even though the current write-up leaves the RL component under-specified and the results hard to evaluate without the full experimental section.

Referee Report

2 major / 2 minor

Summary. The paper proposes Learning to Label (L2L), a reinforced self-evolving framework for semi-supervised referring expression segmentation (SS-RES). It uses a multimodal large language model to extract semantic-spatial priors instantiated as initial soft segmentation proposals, which together with textual cues condition a hierarchical segmentation network. Pseudo-label construction is cast as a learnable decision-making process via reinforced pseudo-label selection formulated as an exploratory decision process that adaptively rewards high-utility pixel-level supervision based on multimodal priors and model predictions. This creates a self-evolving loop for joint optimization of the segmentation model and pseudo-labels to improve label reliability under sparse supervision. Experiments on RefCOCO, RefCOCO+, and RefCOCOg report improvements over existing methods.

Significance. If the reinforced selection mechanism can be shown to produce stable joint optimization without amplifying early errors, the framework would offer a novel way to address unreliable pseudo-labels in SS-RES by integrating MLLM priors with RL-based selection. This could advance semi-supervised pixel-level language grounding methods. The current presentation, however, provides no equations, state/action definitions, or stability analysis, preventing assessment of whether the central claim holds.

major comments (2)

[Reinforced pseudo-label selection description] The reinforced pseudo-label selection is described as an 'exploratory decision process' that 'adaptively rewards high-utility pixel-level supervision,' but no equations, state space, action space, reward function, or policy update rules are provided. This formulation is load-bearing for the self-evolving loop claim, yet the combinatorial size of any per-pixel or region action space in RES is unaddressed, leaving open whether stability against confirmation bias or reward sparsity is achieved.
[Experiments and results] The abstract and experimental claims state that the loop 'progressively enhancing label reliability' and yields improvements on RefCOCO datasets, but no ablation studies isolate the contribution of the RL selection, no error analysis on pseudo-label quality over iterations, and no verification that gains exceed those from MLLM priors alone. This undermines evaluation of the joint optimization claim.

minor comments (2)

[Framework overview] The term 'hierarchical segmentation network' is introduced without architectural details or reference to its base model, making it difficult to reproduce the conditioning mechanism.
[Framework overview] Notation for 'soft segmentation proposals' and 'learnable guidance signals' is introduced without formal definitions or equations, reducing clarity of how MLLM outputs are integrated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to provide the requested details and experiments.

read point-by-point responses

Referee: [Reinforced pseudo-label selection description] The reinforced pseudo-label selection is described as an 'exploratory decision process' that 'adaptively rewards high-utility pixel-level supervision,' but no equations, state space, action space, reward function, or policy update rules are provided. This formulation is load-bearing for the self-evolving loop claim, yet the combinatorial size of any per-pixel or region action space in RES is unaddressed, leaving open whether stability against confirmation bias or reward sparsity is achieved.

Authors: We agree that the current manuscript lacks sufficient mathematical detail on the reinforced pseudo-label selection. In the revised version we will add a dedicated subsection with explicit definitions of the state space (multimodal prior features concatenated with model predictions), region-level action space (to control combinatorial size), reward function (utility combining prior consistency and prediction confidence), and policy gradient update rules, along with analysis of how the exploratory process addresses confirmation bias and sparsity. revision: yes
Referee: [Experiments and results] The abstract and experimental claims state that the loop 'progressively enhancing label reliability' and yields improvements on RefCOCO datasets, but no ablation studies isolate the contribution of the RL selection, no error analysis on pseudo-label quality over iterations, and no verification that gains exceed those from MLLM priors alone. This undermines evaluation of the joint optimization claim.

Authors: We agree additional experiments are required. The revision will include ablations that disable the RL selection (replacing it with fixed thresholding), plots of pseudo-label IoU against ground-truth over iterations on a held-out subset, and a direct baseline using only the initial MLLM priors without the self-evolving loop, to isolate the contribution of the reinforced selection to the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the L2L framework derivation

full rationale

The paper's central claim is a reinforced self-evolving loop for joint optimization of segmentation model and pseudo-labels in SS-RES. This relies on external MLLM priors for initial proposals, standard RL formulation for pseudo-label selection as an exploratory decision process, and evaluation on standard benchmarks (RefCOCO etc.). No equations or steps reduce the target outcome to fitted parameters defined by the outcome itself, nor invoke self-citation chains for uniqueness or ansatzes. The derivation is self-contained with independent content from multimodal models and RL, warranting a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract, the central claim rests on the effectiveness of the reinforced selection loop and the reliability of MLLM priors; no explicit free parameters, axioms, or invented entities are detailed.

axioms (1)

domain assumption Multimodal large language models can extract usable semantic-spatial priors from image-text pairs for initial soft segmentation proposals.
Invoked to build foundational understanding and guidance signals for the segmentation network.

invented entities (1)

Reinforced pseudo-label selection as exploratory decision process no independent evidence
purpose: To adaptively select high-utility pseudo-labels and enable the self-evolving loop
Formulated to ensure stable learning under sparse supervision; no independent evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5725 in / 1244 out tokens · 45814 ms · 2026-06-29T12:44:49.780229+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y ., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y ., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., and Lin, J. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y .-T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K. V ., Khedr, H., Huang, A., et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J. Bert: Pre-training of deep bidirectional trans- formers for language understanding.arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Lgd: Leveraging generative descriptions for zero-shot referring image segmentation.arXiv preprint arXiv:2504.14467,

Li, J., Xie, Q., Gu, R., Xu, J., Liu, Y ., and Yu, X. Lgd: Leveraging generative descriptions for zero-shot referring image segmentation.arXiv preprint arXiv:2504.14467,

work page arXiv
[5]

Ref-diff: Zero-shot referring image segmentation with generative models.arXiv preprint arXiv:2308.16777,

Ni, M., Zhang, Y ., Feng, K., Li, X., Guo, Y ., and Zuo, W. Ref-diff: Zero-shot referring image segmentation with generative models.arXiv preprint arXiv:2308.16777,

work page arXiv
[6]

Text augmented spatial-aware zero-shot referring image segmentation.arXiv preprint arXiv:2310.18049,

Suo, Y ., Zhu, L., and Yang, Y . Text augmented spatial-aware zero-shot referring image segmentation.arXiv preprint arXiv:2310.18049,

work page arXiv
[7]

Llafs++: Few-shot image segmentation with large language mod- els.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025a

Zhu, L., Chen, T., Ji, D., Xu, P., Ye, J., and Liu, J. Llafs++: Few-shot image segmentation with large language mod- els.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025a. Zhu, L., Chen, T., Xu, Q., Liu, X., Ji, D., Wu, H., Soh, D. W., and Liu, J. Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation. ...

2020
[8]

Each instance is an image–expression pair with a binary mask as supervision

is a widely used RES benchmark built on MS-COCO images (Lin et al., 2014). Each instance is an image–expression pair with a binary mask as supervision. The benchmark contains 19,994 images, 50,000 annotated objects, and 142,209 referring expressions. We follow the official split withtrain,val,testA, andtestB. ThetestAsplit is dominated by person instances...

2014
[9]

The dataset includes 26,711 images, 54,822 objects, and 104,560 expressions

is also known as G-Ref and provides longer descriptions with richer composi- tional semantics and relational structures. The dataset includes 26,711 images, 54,822 objects, and 104,560 expressions. Following common practice, we adopt the UMD split and report results onval-uandtest-u, with 23,199 training samples, 2,601 validation samples, and 4,010 test s...

2023
[10]

When constructing the foreground/background/ignored regions, we suppress ambiguous boundary supervision by marking a 3-pixel-wide band around region boundaries as ignored

Reinforced Pseudo-Label Exploration (RPLE).RPLE is initialized with τf g = 0.70 and τbg = 0.20. When constructing the foreground/background/ignored regions, we suppress ambiguous boundary supervision by marking a 3-pixel-wide band around region boundaries as ignored. Both the actor and critic are implemented as two-layer MLPs with hidden size 64 and train...

work page arXiv 2024

[1] [1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y ., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y ., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., and Lin, J. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y .-T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K. V ., Khedr, H., Huang, A., et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J. Bert: Pre-training of deep bidirectional trans- formers for language understanding.arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Lgd: Leveraging generative descriptions for zero-shot referring image segmentation.arXiv preprint arXiv:2504.14467,

Li, J., Xie, Q., Gu, R., Xu, J., Liu, Y ., and Yu, X. Lgd: Leveraging generative descriptions for zero-shot referring image segmentation.arXiv preprint arXiv:2504.14467,

work page arXiv

[5] [5]

Ref-diff: Zero-shot referring image segmentation with generative models.arXiv preprint arXiv:2308.16777,

Ni, M., Zhang, Y ., Feng, K., Li, X., Guo, Y ., and Zuo, W. Ref-diff: Zero-shot referring image segmentation with generative models.arXiv preprint arXiv:2308.16777,

work page arXiv

[6] [6]

Text augmented spatial-aware zero-shot referring image segmentation.arXiv preprint arXiv:2310.18049,

Suo, Y ., Zhu, L., and Yang, Y . Text augmented spatial-aware zero-shot referring image segmentation.arXiv preprint arXiv:2310.18049,

work page arXiv

[7] [7]

Llafs++: Few-shot image segmentation with large language mod- els.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025a

Zhu, L., Chen, T., Ji, D., Xu, P., Ye, J., and Liu, J. Llafs++: Few-shot image segmentation with large language mod- els.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025a. Zhu, L., Chen, T., Xu, Q., Liu, X., Ji, D., Wu, H., Soh, D. W., and Liu, J. Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation. ...

2020

[8] [8]

Each instance is an image–expression pair with a binary mask as supervision

is a widely used RES benchmark built on MS-COCO images (Lin et al., 2014). Each instance is an image–expression pair with a binary mask as supervision. The benchmark contains 19,994 images, 50,000 annotated objects, and 142,209 referring expressions. We follow the official split withtrain,val,testA, andtestB. ThetestAsplit is dominated by person instances...

2014

[9] [9]

The dataset includes 26,711 images, 54,822 objects, and 104,560 expressions

is also known as G-Ref and provides longer descriptions with richer composi- tional semantics and relational structures. The dataset includes 26,711 images, 54,822 objects, and 104,560 expressions. Following common practice, we adopt the UMD split and report results onval-uandtest-u, with 23,199 training samples, 2,601 validation samples, and 4,010 test s...

2023

[10] [10]

When constructing the foreground/background/ignored regions, we suppress ambiguous boundary supervision by marking a 3-pixel-wide band around region boundaries as ignored

Reinforced Pseudo-Label Exploration (RPLE).RPLE is initialized with τf g = 0.70 and τbg = 0.20. When constructing the foreground/background/ignored regions, we suppress ambiguous boundary supervision by marking a 3-pixel-wide band around region boundaries as ignored. Both the actor and critic are implemented as two-layer MLPs with hidden size 64 and train...

work page arXiv 2024