arxiv: 2601.18296 · v2 · submitted 2026-01-26 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

Temp-R1: A Unified Autonomous Agent for Complex Temporal KGQA via Reverse Curriculum Reinforcement Learning

Zhaoyan Gong , Zhiqiang Liu , Songze Li , Xiaoke Guo , Yuanxiang Liu , Xinle Deng , Zhizhen Liu , Lei Liang

show 2 more authors

Huajun Chen Wen Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords temporal knowledge graph question answeringreinforcement learningcurriculum learningautonomous agentsmulti-hop temporal reasoningtemporal constraints

0 comments

The pith

Reverse curriculum reinforcement learning trains an 8B autonomous agent to master complex temporal knowledge graph question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that an end-to-end reinforcement learning agent can solve temporal knowledge graph question answering without relying on fixed workflows or closed-source APIs. It expands the agent's action space with internal actions to reduce cognitive overload during multi-hop reasoning. The central training method is reverse curriculum learning, which begins with the hardest questions containing complex temporal constraints to force development of advanced reasoning before moving to simpler cases. This produces an 8B-parameter model that reaches state-of-the-art results on MultiTQ and TimelineKGQA, with particular gains on difficult questions. If correct, the work shows that careful ordering of training difficulty can create scalable, open autonomous agents for dynamic reasoning tasks.

Core claim

Temp-R1 is the first autonomous end-to-end agent for TKGQA trained through reinforcement learning. It adds specialized internal actions to the action space and applies reverse curriculum learning that trains on difficult questions first, forcing the development of sophisticated multi-hop temporal reasoning before transferring skills to easier cases, and achieves state-of-the-art performance with an 8B-parameter model on MultiTQ and TimelineKGQA.

What carries the argument

Reverse curriculum reinforcement learning over an expanded action space containing both internal and external actions, which compels the agent to acquire multi-hop temporal reasoning by confronting complex constraints before simpler ones.

If this is right

Removes dependence on fixed workflows and closed-source APIs for temporal reasoning
Delivers measurable gains on multi-hop and temporally constrained questions
Allows smaller open models to perform end-to-end dynamic reasoning
Demonstrates that curriculum ordering can transfer advanced skills to simpler inputs

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reverse-ordering principle might apply to other multi-step reasoning domains such as mathematical proofs or long-horizon planning
Internal actions could become a standard way to increase autonomy across different agent architectures
Curriculum design choices like this one may offer a general lever against shortcut learning in reinforcement learning for structured tasks
The method leaves open whether the learned reasoning generalizes to entirely new temporal knowledge graphs not seen during training

Load-bearing premise

That beginning training on the most difficult questions will force development of robust sophisticated reasoning without causing training instability or poor generalization when the agent later encounters easier cases.

What would settle it

A side-by-side comparison in which an identical 8B model trained with standard or random question ordering shows no gain, or a loss, on complex questions relative to the reverse-curriculum version.

Figures

Figures reproduced from arXiv: 2601.18296 by Huajun Chen, Lei Liang, Songze Li, Wen Zhang, Xiaoke Guo, Xinle Deng, Yuanxiang Liu, Zhaoyan Gong, Zhiqiang Liu, Zhizhen Liu.

**Figure 1.** Figure 1: Paradigm shift in TKGQA systems. Existing approaches (a) suffer from fixed workflow and high API cost. Temp-R1 (b) is an autonomous end-to-end agent trained for dynamic temporal reasoning. constraints (Su et al., 2024), and multi-granular time (Chen et al., 2023), which typically require combining external knowledge retrieval with the temporal reasoning capabilities of large language models (LLMs). Recent … view at source ↗

**Figure 2.** Figure 2: Comparison of internal reasoning mechanisms: Temp-R1 vs. Search-R1. By decoupling internal reasoning into explicit <filter> and <rank> actions (right), Temp-R1 maintains logical rigor and eliminates hallucinations in temporal sequencing. In contrast, the monolithic <think> block in Search-R1 (left) suffers from cognitive overload, leading to a failure in processing the retrieved temporal facts. performanc… view at source ↗

**Figure 3.** Figure 3: Overall architecture of Temp-R1. The rollout loop features an expanded internal and external action space (Ainternal and Aexternal) designed for structured temporal reasoning. The model is trained via a two-stage process: supervised cold start from high-quality trajectories, followed by GRPO-based reinforcement learning supported by reverse curriculum learning strategies. Policy Optimization for discoverin… view at source ↗

**Figure 4.** Figure 4: Performance comparison of Temp-R1 across [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 7.** Figure 7: Training dynamics with/without SFT Cold Start. (a) Training Reward, (b) KL Loss, and (c) Val Accuracy. The blue line shows better stability and higher performance compared to the orange line. 0 20 40 60 80 100 120 140 Training Steps 0.2 0.4 0.6 0.8 Training Reward (a) Training Reward Hard first Easy first 0 20 40 60 80 100 120 140 Training Steps 1.8 2.0 2.2 2.4 2.6 2.8 3.0 Valid Actions (b) Number of Actio… view at source ↗

**Figure 6.** Figure 6: Training dynamics between LLaMA and Qwen Architecture. (a) Training Reward, (b) Response Length, and (c) Val Accuracy. on MULTITQ. It is worth noting that while baselines rely on powerful closed-source LLMs (e.g., GPT-4o-mini, or DeepSeek-V3), our Temp-R1, based on an 8B open-source backbone, achieves superior results. This demonstrates that a smaller model can surpass the performance of much larger, prop… view at source ↗

**Figure 9.** Figure 9: Autonomous and flexible reasoning trajectories: state transition diagrams across six question types. Search Filter Rank Plan Think 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Average Number of Actions 1.33 1.13 0.32 1.04 1.36 1.92 1.71 0.74 1.09 2.93 Single Multiple [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of the average number of actions for Single vs. Multiple tasks across internal action, external action and thinking. 5.4 Impact of Backbone Model Selection Model Scale: 7B, 3B vs. 1.5B [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

read the original abstract

Temporal Knowledge Graph Question Answering (TKGQA) is inherently challenging, as it requires sophisticated reasoning over dynamic facts with multi-hop dependencies and complex temporal constraints. Existing methods rely on fixed workflows and expensive closed-source APIs, limiting flexibility and scalability. We propose Temp-R1, the first autonomous end-to-end agent for TKGQA trained through reinforcement learning. To address cognitive overload in single-action reasoning, we expand the action space with specialized internal actions alongside external action. To prevent shortcut learning on simple questions, we introduce reverse curriculum learning that trains on difficult questions first, forcing the development of sophisticated reasoning before transferring to easier cases. Our 8B-parameter Temp-R1 achieves state-of-the-art performance on MultiTQ and TimelineKGQA, improving 19.8% over strong baselines on complex questions. Our work establishes a new paradigm for autonomous temporal reasoning agents. The code is available at https://github.com/zjukg/Temp-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Temp-R1 gives an RL agent for temporal KGQA with internal actions and reverse curriculum, but the 19.8% gain lacks ablations to show the curriculum is the driver.

read the letter

Temp-R1 trains an 8B model as an end-to-end RL agent for temporal KGQA. It adds specialized internal actions to break up complex reasoning and uses reverse curriculum learning that starts with hard questions to force multi-hop temporal logic before moving to easier cases. The abstract reports SOTA numbers on MultiTQ and TimelineKGQA with a 19.8% lift on complex questions over strong baselines, and the code is released at the GitHub link. That specific mix of expanded actions plus reverse ordering is not in the cited prior work, so the agent design is new on its face. The approach makes sense for cutting dependence on fixed workflows and closed APIs. The main gap is that we see no reward formulation, no training curves, no ablation that compares reverse ordering to standard curriculum or to action expansion alone, and no difficulty-stratified results. Without those controls it is hard to know whether the gains come from the curriculum, the larger action space, or simply the base model. RL can overfit the hard subset or fail to generalize, so the central claim needs that evidence. This is for groups working on KG reasoning agents who want to test whether pure RL can replace scripted pipelines. It deserves peer review because the claims are concrete, the code is public, and the empirical results can be checked directly even if the current write-up is light on diagnostics.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Temp-R1, an 8B-parameter autonomous agent for Temporal Knowledge Graph Question Answering (TKGQA) trained end-to-end via reinforcement learning. It expands the action space with specialized internal actions to mitigate cognitive overload in single-step reasoning and applies reverse curriculum learning by training first on difficult questions to force development of multi-hop temporal reasoning before transferring to easier cases. The central empirical claim is state-of-the-art performance on MultiTQ and TimelineKGQA, with a 19.8% improvement over strong baselines on complex questions; code is released.

Significance. If the performance gains are shown to stem specifically from the reverse curriculum and expanded action space rather than base model scale or implementation details, the work would establish a viable new paradigm for autonomous, API-free temporal reasoning agents. The code release is a positive factor for reproducibility and follow-on research in knowledge-graph question answering.

major comments (3)

[Abstract and Experiments] Abstract and Experiments section: The 19.8% improvement on complex questions is reported without naming the exact baselines, number of runs, or statistical significance tests, leaving the SOTA claim difficult to evaluate.
[Method (Reverse Curriculum)] Method section on reverse curriculum: No ablation comparing reverse ordering to standard or random curriculum ordering is provided, nor are training curves or difficulty-stratified accuracy results shown; this leaves open whether the gains arise from curriculum order, action-space expansion, or the base 8B model.
[Experiments] Experiments section: Reward design, precise action-space implementation details, and full baseline comparisons are not described at a level that would permit reproduction or isolation of the reverse curriculum's contribution.

minor comments (1)

[Abstract] Abstract: The phrase 'strong baselines' should be replaced with the specific model names used for the 19.8% comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that clarifying experimental details, adding ablations, and expanding implementation descriptions will strengthen the paper and improve reproducibility. Below we address each major comment point by point, indicating the revisions we will make.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The 19.8% improvement on complex questions is reported without naming the exact baselines, number of runs, or statistical significance tests, leaving the SOTA claim difficult to evaluate.

Authors: We acknowledge that the abstract and experiments section should provide these specifics for proper evaluation. In the revised manuscript we will explicitly name the baselines used for the 19.8% figure, report the number of independent runs performed, and include statistical significance testing (e.g., paired t-tests). These details were present in our internal experimental logs and will be added to the text. revision: yes
Referee: [Method (Reverse Curriculum)] Method section on reverse curriculum: No ablation comparing reverse ordering to standard or random curriculum ordering is provided, nor are training curves or difficulty-stratified accuracy results shown; this leaves open whether the gains arise from curriculum order, action-space expansion, or the base 8B model.

Authors: We agree that an explicit ablation isolating the reverse curriculum ordering would strengthen the claim. The original submission focused on the motivation and final performance but omitted comparative runs against standard and random curricula. In the revision we will add these ablations, training curves, and difficulty-stratified accuracy tables to demonstrate the specific contribution of the reverse ordering. revision: yes
Referee: [Experiments] Experiments section: Reward design, precise action-space implementation details, and full baseline comparisons are not described at a level that would permit reproduction or isolation of the reverse curriculum's contribution.

Authors: We note that the full implementation, including reward design and action-space details, is released in the accompanying code repository. However, we accept that the paper itself should be more self-contained. In the revised version we will expand the Experiments section with precise descriptions of the reward function, action-space implementation, and more granular baseline comparisons to allow readers to isolate the reverse curriculum's contribution without needing to consult the code. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical RL results on external benchmarks

full rationale

The paper presents an RL-trained agent using reverse curriculum learning and reports performance gains on standard external benchmarks (MultiTQ, TimelineKGQA). No derivation chain exists that reduces predictions or claims to fitted parameters, self-definitions, or self-citation load-bearing steps; the SOTA numbers are measured outcomes rather than quantities forced by construction from the training procedure itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard RL assumptions for agent training and the effectiveness of reverse curriculum ordering; no new physical entities or ad-hoc constants are introduced in the abstract.

axioms (2)

domain assumption Reinforcement learning with an expanded action space can train an agent to perform multi-hop temporal reasoning without fixed workflows.
Invoked when claiming the end-to-end autonomous agent succeeds on complex TKGQA.
domain assumption Starting training on difficult questions prevents shortcut learning and transfers to easier cases.
Core justification for the reverse curriculum component.

pith-pipeline@v0.9.0 · 5498 in / 1341 out tokens · 46748 ms · 2026-05-16T11:40:16.697717+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
cs.LG 2026-04 unverdicted novelty 7.0

TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

InEMNLP, pages 1867–1877

Semantic framework based query generation for temporal question answering over knowledge graphs. InEMNLP, pages 1867–1877. Association for Computational Linguistics. Yifu Gao, Linbo Qiao, Zhigang Kan, Zhihua Wen, Yongquan He, and Dongsheng Li. 2024. Two-stage generative question answering on temporal knowl- edge graph using large language models. InACL (F...

work page 2024
[2]

InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6013–6024, Albuquerque, New Mexico

Time-aware ReAct agent for temporal knowl- edge graph question answering. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6013–6024, Albuquerque, New Mexico. Association for Computational Linguistics. Zhen Jia, Abdalghani Abujabal, Rishiraj Saha Roy, Jan- nik Strötgen, and Gerhard Weikum. 2018a. Tem- pquestions: A benchmark ...

work page 2025
[3]

Zhen Jia, Abdalghani Abujabal, Rishiraj Saha Roy, Jannik Strötgen, and Gerhard Weikum

ACM. Zhen Jia, Abdalghani Abujabal, Rishiraj Saha Roy, Jannik Strötgen, and Gerhard Weikum. 2018b. TEQUILA: temporal question answering over knowl- edge bases. InCIKM, pages 1807–1810. ACM. Zhen Jia, Philipp Christmann, and Gerhard Weikum

work page
[4]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Faithful temporal question answering over heterogeneous sources. InWWW, pages 2052–2063. ACM. Zhen Jia, Soumajit Pramanik, Rishiraj Saha Roy, and Gerhard Weikum. 2021. Complex temporal question answering on knowledge graphs. InCIKM, pages 792–802. ACM. Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. ...

work page internal anchor Pith review Pith/arXiv arXiv 2052
[5]

Sumit Neelam, Udit Sharma, Hima Karanam, Shajith Ik- bal, Pavan Kapanipathi, Ibrahim Abdelaziz, Nandana Mihindukulasooriya, Young-Suk Lee, Santosh K

AAAI Press. Sumit Neelam, Udit Sharma, Hima Karanam, Shajith Ik- bal, Pavan Kapanipathi, Ibrahim Abdelaziz, Nandana Mihindukulasooriya, Young-Suk Lee, Santosh K. Sri- vastava, Cezar Pendus, Saswati Dana, Dinesh Garg, Achille Fokoue, G. P. Shrivatsa Bhargav, Dinesh Khandelwal, Srinivas Ravishankar, Sairam Gurajada, Maria Chang, Rosario Uceda-Sosa, and 6 others

work page
[6]

SYGMA: system for generalizable modular question answering over knowledge bases.CoRR, abs/2109.13430. OpenAI, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allis...

work page arXiv 2024
[7]

Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Sys- tems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Apoorv Saxena, Soumen Chakrabarti, and Partha P. Talukdar. 2021. Question answering over tempo...

work page 2023
[8]

Proximal Policy Optimization Algorithms

Association for Computational Linguistics. Apoorv Saxena, Aditay Tripathi, and Partha P. Taluk- dar. 2020. Improving multi-hop question answering over knowledge graphs using knowledge base em- beddings. InACL, pages 4498–4507. Association for Computational Linguistics. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Pr...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[9]

Before/After

adapts theReActframework with tempo- ral reasoning toolkits powered by GPT-4-Turbo; MemoTime (Tan et al., 2025) retrieves past solu- tion chains as external memory to enhance rea- soning consistency, supporting multiple APIs in- cluding GPT-4o-mini, GPT-4o, DeepSeek-V3, and DeepSeek-R1. For fair comparison, we report results using DeepSeek-V3 in our main ...

work page 2025
[10]

Identify agreements signed by Oman after May 2015

work page 2015
[11]

Determine which countries were involved. - Answer format: Country name </plan> <search>Oman agreement signed after May 2015</search> <information> Doc 1(Title: Oman Sign_formal_agreement Malaysia on 2015-10-22) Doc 2(Title: Oman Sign_formal_agreement Kuwait on 2015-02-19) Doc 3(Title: Oman Sign_formal_agreement Iran on 2015-05-26) Doc 4(Title: Malaysia Si...

work page 2015
[12]

When did Médecins Sans Frontières make the request for Iraq?

work page
[13]

Who made the request before that date?

work page
[14]

Doctors_Without_Borders

Identify the most recent request before the one by Médecins Sans Frontières. - Answer format: Entity name </plan> <think>First, I need to find when the request for Iraq by Médecins Sans Frontières was made.</think> <search>Médecins Sans Frontières request Iraq</search> <information> Doc 1(Title: Doctors_Without_Borders Make_an_appeal_or_request Iraq on 20...

work page 2014
[15]

When did the Chancellor of the Exchequer visit China?

work page
[16]

in the same month as Christine Lagarde

Identify the earliest such visit. - Answer format: yyyy </plan> <think>First, I need to find visits to China by the Chancellor of the Exchequer and identify the earliest one.</think> <search>Chancellor of the Exchequer visit China</search> <information> Doc 1(Title: Chancellor_of_the_Exchequer Make_a_visit China on 2005-02-21) Doc 2(Title: Chancellor_of_t...

work page 2005
[17]

When did Christine Lagarde have relevant negotiations or event with Kuwait?

work page
[18]

first use after Ethiopia

Who was Kuwait negotiating with in that same month? - Answer format: Entity name or concise phrase </plan> <think>First, I need to find out when Christine Lagarde had relevant negotiations or event with Kuwait.</think> <search>Christine Lagarde negotiations Kuwait</search> <information> Doc 1(Title: Christine_Lagarde Engage_in_negotiation Kuwait on 2009-0...

work page 2009
[19]

When did South Sudan use conventional military force against Ethiopia?

work page
[20]

What other countries has South Sudan used conventional military force against, and when?

work page
[21]

(omitted 20 Docs) </information> <think> The date when South Sudan used conventional military force against Ethiopia was 2012-01-06

Among these events, which occurred first after the one against Ethiopia? - Answer format: Country name </plan> <think>First, I need to find when South Sudan used conventional military force against Ethiopia.</think> <search>South Sudan use conventional military force Ethiopia</search> <information> Doc 1(Title: South_Sudan Use_conventional_military_force ...

work page 2012