Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

2); (2) UCLA); Alexander Min (1) ((1) Meta Superintelligence Labs; Cho-Jui Hsieh (2); Hayeon Lee (1); Shuibenyang Yuan (1); Sohyun An (1

arxiv: 2604.12967 · v1 · submitted 2026-04-14 · 💻 cs.AI

Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

Sohyun An (1 , 2) , Shuibenyang Yuan (1) , Hayeon Lee (1) , Cho-Jui Hsieh (2) , Alexander Min (1) ((1) Meta Superintelligence Labs , (2) UCLA) This is my paper

Pith reviewed 2026-05-10 15:14 UTC · model grok-4.3

classification 💻 cs.AI

keywords cycle-consistent searchsearch agent trainingreinforcement learningproxy rewardquestion reconstructioninformation bottleneckunsupervised training

0 comments

The pith

Search agents can train without gold answers by using the reconstructability of the original question from their trajectories as a reward signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Cycle-Consistent Search, a reinforcement learning framework that trains search agents without any ground-truth answers by rewarding trajectories that allow accurate reconstruction of the input question. The central idea is that a high-quality search path preserves the question's intent as a lossless encoding, so reconstruction success serves as a scalable proxy for policy optimization. Information bottlenecks, such as dropping the final response and masking named entities in queries, are added to block superficial shortcuts and force the reward to depend on the actual retrieved content and search structure. On standard question-answering benchmarks the resulting agents match the performance of supervised methods while beating earlier unsupervised baselines.

Core claim

Cycle-Consistent Search treats question reconstruction accuracy as the reward for a search trajectory, with the hypothesis that only an optimal trajectory encodes the question's full intent. The method applies cycle-consistency by feeding the trajectory back into a reconstruction model and optimizes the search policy via reinforcement learning. Information bottlenecks are imposed by excluding the final answer and applying named-entity masking to queries so that reconstruction must rely on the observations and the structural path rather than lexical leakage. This produces a gold-supervision-free training signal that yields benchmark results comparable to supervised baselines.

What carries the argument

The cycle-consistency loop that scores a search trajectory by how well a reconstruction model can recover the original question from the sequence of observations, subject to information bottlenecks that exclude the final response and mask named entities.

If this is right

Search agents become trainable in any domain where gold answers are unavailable or expensive to obtain.
The proxy reward directly measures whether a trajectory has captured the information needed to recover the original question.
Performance on question-answering tasks reaches levels comparable to methods that use full supervised labels.
The approach outperforms earlier gold-free methods that lack this cycle-consistency signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reconstruction-as-reward pattern could be tested on multi-hop reasoning agents by reconstructing the initial problem statement from the full reasoning trace.
Stronger reconstruction models might increase the sensitivity of the reward, allowing finer distinctions between near-optimal and suboptimal trajectories.
The framework could be combined with other self-supervised signals such as consistency across paraphrased questions to further stabilize training.

Load-bearing premise

An optimal search trajectory encodes the question's intent in a way that permits accurate reconstruction only when the search itself is informationally complete, and the bottlenecks prevent reconstruction from succeeding via superficial cues instead.

What would settle it

Train two agents on the same questions, one with the full reconstruction reward and one with the bottlenecks removed, then measure whether the version without bottlenecks achieves high reconstruction scores while producing visibly poorer retrieval quality on held-out queries.

read the original abstract

Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our key hypothesis is that an optimal search trajectory, unlike insufficient or irrelevant ones, serves as a lossless encoding of the question's intent. Consequently, a high-quality trajectory should preserve the information required to accurately reconstruct the original question, thereby inducing a reward signal for policy optimization. However, naive cycle-consistency objectives are vulnerable to information leakage, as reconstruction may rely on superficial lexical cues rather than the underlying search process. To reduce this effect, we apply information bottlenecks, including exclusion of the final response and named entity recognition (NER) masking of search queries. These constraints force reconstruction to rely on retrieved observations together with the structural scaffold, ensuring that the resulting reward signal reflects informational adequacy rather than linguistic redundancy. Experiments on question-answering benchmarks show that CCS achieves performance comparable to supervised baselines while outperforming prior methods that do not rely on gold supervision. These results suggest that CCS provides a scalable training paradigm for training search agents in settings where gold supervision is unavailable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CCS adapts cycle consistency to reward search trajectories by reconstruction quality, but the bottlenecks may not fully block lexical leakage from documents.

read the letter

The main thing to know is that this paper offers a way to train search agents without gold labels by using how well you can reconstruct the original question from the trajectory as the reward signal. They add NER masking to the queries and exclude the final response to push the reconstruction to depend on the actual search results rather than easy word matches. Experiments on QA benchmarks show performance close to supervised RL while beating other unsupervised baselines.

Referee Report

2 major / 1 minor

Summary. The paper introduces Cycle-Consistent Search (CCS), a gold-supervision-free RL framework for training search agents. It posits that an optimal search trajectory acts as a lossless encoding of question intent, enabling a proxy reward via question reconstruction from the trajectory. Information bottlenecks (NER masking of queries and exclusion of the final response) are applied to mitigate lexical leakage, forcing reliance on retrieved observations. Experiments on QA benchmarks claim performance comparable to supervised baselines and superior to prior unsupervised methods.

Significance. If the core hypothesis and bottleneck effectiveness hold, CCS offers a scalable alternative to gold-label-dependent training for search agents, extending cycle-consistency ideas from translation and image domains to RL-based information retrieval. This could reduce reliance on expensive supervision in complex retrieval tasks.

major comments (2)

[Abstract / Method] Abstract and method description: The central claim that NER masking plus final-response exclusion eliminates lexical leakage (allowing reconstruction to depend on search quality rather than superficial cues) is load-bearing for the proxy reward. However, retrieved documents may still contain direct lexical matches or paraphrases from the original question, and no ablation or diagnostic (e.g., reconstruction accuracy on trajectories vs. documents alone) is provided to verify that the bottlenecks succeed.
[Experiments] Experiments section: The claim of 'comparable performance to supervised baselines' and outperformance of priors lacks reported statistical significance, exact metrics, baseline details, or variance across runs. Without these, it is difficult to assess whether the results robustly support the hypothesis that reconstruction quality serves as a faithful proxy for search optimality.

minor comments (1)

[Method] Notation for the cycle-consistency objective and reward formulation should be clarified with explicit equations to distinguish the information-bottlenecked reconstruction from naive cycle loss.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our paper. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: The central claim that NER masking plus final-response exclusion eliminates lexical leakage (allowing reconstruction to depend on search quality rather than superficial cues) is load-bearing for the proxy reward. However, retrieved documents may still contain direct lexical matches or paraphrases from the original question, and no ablation or diagnostic (e.g., reconstruction accuracy on trajectories vs. documents alone) is provided to verify that the bottlenecks succeed.

Authors: We agree that verifying the effectiveness of the proposed information bottlenecks is important for supporting our central hypothesis. The manuscript explains the motivation for applying NER masking to queries and excluding the final response to prevent lexical leakage. However, to provide stronger evidence that reconstruction depends on the retrieved observations rather than superficial cues, we will add an ablation study in the revised version. This will include comparisons of question reconstruction accuracy using complete trajectories versus using only the retrieved documents, as well as with and without the bottlenecks applied. These diagnostics will help confirm that the proxy reward is indeed tied to search quality. revision: yes
Referee: [Experiments] Experiments section: The claim of 'comparable performance to supervised baselines' and outperformance of priors lacks reported statistical significance, exact metrics, baseline details, or variance across runs. Without these, it is difficult to assess whether the results robustly support the hypothesis that reconstruction quality serves as a faithful proxy for search optimality.

Authors: We appreciate this feedback on the experimental reporting. While the current manuscript states that CCS achieves comparable performance to supervised baselines and outperforms prior unsupervised methods on QA benchmarks, we acknowledge that more detailed statistical analysis is needed. In the revised manuscript, we will include exact performance metrics with standard deviations from multiple runs, full details on the baselines and their configurations, and results of statistical significance tests to robustly support our claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines CCS as a reconstruction-based proxy reward with explicit information bottlenecks (NER masking, exclusion of final response) to force dependence on search trajectory content rather than lexical cues. This is presented as a hypothesis, not derived from prior fitted parameters or self-citations. The central claim is evaluated via external QA benchmarks against supervised baselines, with no equations or steps that reduce the reward signal to its own inputs by construction. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear. The method stands as an independent proposal whose validity rests on empirical outcomes rather than tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a new proxy reward mechanism based on cycle-consistency with constraints, relying on the domain assumption about search trajectories encoding question intent.

axioms (1)

domain assumption An optimal search trajectory preserves the information required to reconstruct the original question's intent.
This is the central hypothesis stated in the abstract.

pith-pipeline@v0.9.0 · 5586 in / 1337 out tokens · 78789 ms · 2026-05-10T15:14:31.236376+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

**Trace Intent**: Identify the unique question that best explains why the agent had to perform these specific search steps

work page
[2]

Do not use your internal pre- trained knowledge to fill in gaps

**Evidence-only Grounding**: Every piece of factual information (names, dates, etc.) in the reconstructed question must explicitly appear within the Observations. Do not use your internal pre- trained knowledge to fill in gaps

work page
[3]

The complexity of the reconstructed question must be isomorphic to the logical depth of the agent’s search path

**Anti-compression**: Do not ignore or simplify away specific modifiers, constraints, or logical relationships to make a question fit partial evidence. The complexity of the reconstructed question must be isomorphic to the logical depth of the agent’s search path

work page
[4]

**Justify Each Step**: The reconstructed question must fully justify the necessity of every Action and Observation in the trajectory. ### 5. N/A Conditions (Output ONLY "N/A" if any apply)

work page
[5]

**Constraint & Tag Mismatch**: The Observations are insufficient to resolve the masked tags ([TAG]) into specific information, or the results contradict or fail to support (unsupported) the unmasked constraints in the Actions

work page
[6]

**Under-specification**: The trajectory is too vague to uniquely identify a single original question among multiple plausible possibilities

work page
[7]

**Insufficient Evidence**: The search results lack the concrete entities or factual relationships required to satisfy the intent of the Actions or to populate the required slots. ### 6. Output Format 18 - Output ONLY the reconstructed question string or "N/A". - No preface, no explanation, and no concluding remarks. B.4 Baselines RAGWeuseastandardretrieve...

work page 2022

[1] [1]

**Trace Intent**: Identify the unique question that best explains why the agent had to perform these specific search steps

work page

[2] [2]

Do not use your internal pre- trained knowledge to fill in gaps

**Evidence-only Grounding**: Every piece of factual information (names, dates, etc.) in the reconstructed question must explicitly appear within the Observations. Do not use your internal pre- trained knowledge to fill in gaps

work page

[3] [3]

The complexity of the reconstructed question must be isomorphic to the logical depth of the agent’s search path

**Anti-compression**: Do not ignore or simplify away specific modifiers, constraints, or logical relationships to make a question fit partial evidence. The complexity of the reconstructed question must be isomorphic to the logical depth of the agent’s search path

work page

[4] [4]

**Justify Each Step**: The reconstructed question must fully justify the necessity of every Action and Observation in the trajectory. ### 5. N/A Conditions (Output ONLY "N/A" if any apply)

work page

[5] [5]

**Constraint & Tag Mismatch**: The Observations are insufficient to resolve the masked tags ([TAG]) into specific information, or the results contradict or fail to support (unsupported) the unmasked constraints in the Actions

work page

[6] [6]

**Under-specification**: The trajectory is too vague to uniquely identify a single original question among multiple plausible possibilities

work page

[7] [7]

**Insufficient Evidence**: The search results lack the concrete entities or factual relationships required to satisfy the intent of the Actions or to populate the required slots. ### 6. Output Format 18 - Output ONLY the reconstructed question string or "N/A". - No preface, no explanation, and no concluding remarks. B.4 Baselines RAGWeuseastandardretrieve...

work page 2022