A Pilot Benchmark for NL-to-FOL Translation in Planetary Exploration

Hayden Moore; Mahfuza Farooque; Suman Saha

arxiv: 2605.17911 · v1 · pith:TEMCCLUHnew · submitted 2026-05-18 · 💻 cs.CL

A Pilot Benchmark for NL-to-FOL Translation in Planetary Exploration

Hayden Moore , Suman Saha , Mahfuza Farooque This is my paper

Pith reviewed 2026-05-20 11:17 UTC · model grok-4.3

classification 💻 cs.CL

keywords natural language to logicfirst-order logicbenchmark datasetplanetary explorationNASA missionsformal reasoningautonomous agentstemporal structure

0 comments

The pith

This paper introduces a pilot benchmark for translating natural language mission documents from planetary exploration into first-order logic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors create a benchmark to convert descriptions of space missions from NASA's records into formal logic statements. This matters because future robotic explorers will need to understand and reason about complex mission rules with little human input and under severe communication limits. The dataset draws from real documents covering launches through orbital operations between 2003 and 2013. Annotations supply logic formulas that track time, agent roles, and operational dependencies. The work also supplies predicate vocabularies and typed constants so researchers can test translation systems with different amounts of prior knowledge provided.

Core claim

We introduce a pilot benchmark for translating natural language into First-Order Logic within the domain of planetary exploration. The dataset is constructed from real mission documentation sourced from NASA's Planetary Data System, spanning missions from 2003 to 2013. These documents describe mission phases such as launch, boost, coast, cruise, and orbital operations in rich natural language. We manually annotate these documents with corresponding FOL representations that capture temporal structure, agent roles, and operational dependencies. In addition, we provide structured predicate vocabularies and typed constants to enable controlled experimentation with varying levels of prior knowing

What carries the argument

The manually created dataset that pairs natural language descriptions of planetary missions with first-order logic formulas, together with supplied predicate vocabularies and typed constants.

Load-bearing premise

The manual annotations correctly represent the temporal structure, agent roles, and operational dependencies in the mission documents without introducing significant errors or omissions.

What would settle it

Run an automated reasoner on the provided FOL translations and check whether it can derive known safety constraints or operational rules that are stated or implied in the original NASA documents; consistent failure would show the annotations do not support reliable downstream reasoning.

Figures

Figures reproduced from arXiv: 2605.17911 by Hayden Moore, Mahfuza Farooque, Suman Saha.

read the original abstract

Future planetary exploration envisions autonomous robotic agents operating under severe communication constraints, without global positioning, and with minimal human intervention. In such environments, agents must not only perceive and act, but also reason over mission objectives, operational constraints, and evolving environmental conditions. While prior work has largely focused on perception and control, the translation of high-level mission knowledge into structured, machine-interpretable representations remains underexplored. We introduce a pilot benchmark for translating natural language (NL) into First-Order Logic (FOL) within the domain of planetary exploration. The dataset is constructed from real mission documentation sourced from NASA's Planetary Data System (PDS), spanning missions from 2003 to 2013. These documents describe mission phases such as launch, boost, coast, cruise, and orbital operations in rich natural language. We manually annotate these documents with corresponding FOL representations that capture temporal structure, agent roles, and operational dependencies. In addition, we provide structured predicate vocabularies and typed constants to enable controlled experimentation with varying levels of prior knowledge. This pilot benchmark provides a foundation for research at the intersection of language understanding and formal reasoning, grounded in real-world, safety-critical mission data. The dataset is provided at: https://github.com/HaydenMM/planetary-logic-benchmark/blob/main/pilot_benchmark.json

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a pilot benchmark for translating natural language into First-Order Logic (FOL) for planetary exploration. The dataset is derived from real NASA Planetary Data System (PDS) mission documents spanning 2003 to 2013, which are manually annotated with FOL representations to capture temporal structure, agent roles, and operational dependencies. Structured predicate vocabularies and typed constants are also provided to support experimentation.

Significance. This work has potential significance in bridging natural language processing and formal reasoning for autonomous systems in constrained environments. The use of authentic mission documentation adds practical relevance that synthetic benchmarks lack. However, the absence of quality assurance details for the annotations reduces the immediate utility until addressed.

major comments (2)

[Manual Annotation] No annotation protocol, inter-annotator agreement, or validation against experts is described in the manuscript. This is load-bearing for the claim that the FOL formulas accurately represent the mission documents' temporal and dependency aspects.
[Benchmark Construction] The paper lacks any error analysis or spot-checks for the FOL translations, which is particularly important given the sensitivity of logical representations to quantifier and predicate choices.

minor comments (2)

[Abstract] Include quantitative details such as the number of documents or total FOL formulas in the pilot dataset to better contextualize the scale.
[GitHub Repository] The link is provided, but consider adding a description of the file structure or example entries in the main text for reader convenience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review of our pilot benchmark. The comments correctly identify areas where additional documentation would improve the benchmark's utility and transparency. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Manual Annotation] No annotation protocol, inter-annotator agreement, or validation against experts is described in the manuscript. This is load-bearing for the claim that the FOL formulas accurately represent the mission documents' temporal and dependency aspects.

Authors: We agree that the absence of a described annotation protocol limits the immediate interpretability of the benchmark. The annotations were produced by the lead author following an internal iterative process: drafting FOL formulas from the source documents, cross-referencing against the provided predicate vocabulary and typed constants, and checking for coverage of temporal ordering, agent roles, and operational dependencies. Because this is explicitly a pilot release, no multi-annotator study was conducted. In the revised manuscript we will insert a new subsection that (i) states the annotation guidelines used, (ii) describes the single-annotator workflow and any consistency checks performed, and (iii) explicitly notes the lack of inter-annotator agreement statistics as a limitation of the current pilot, together with plans for future multi-annotator validation. revision: yes
Referee: [Benchmark Construction] The paper lacks any error analysis or spot-checks for the FOL translations, which is particularly important given the sensitivity of logical representations to quantifier and predicate choices.

Authors: We concur that systematic error analysis is valuable for a logic benchmark. During construction we performed informal spot-checks to verify that quantifier scope and predicate selection matched the intended semantics of each mission document excerpt. These checks were not reported. The revised version will contain a dedicated error-analysis subsection that (a) enumerates recurring translation challenges (e.g., temporal scoping, choice between event and state predicates), (b) presents representative examples of both successful and ambiguous cases drawn from the released dataset, and (c) summarizes the outcomes of the internal consistency reviews already performed. This addition will give users a clearer picture of the benchmark's current strengths and limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark is direct dataset construction from external documents

full rationale

The paper introduces a pilot benchmark by manually annotating real NASA PDS mission documents (2003-2013) with FOL representations capturing temporal structure, agent roles, and dependencies. No mathematical derivations, equations, fitted parameters, or predictions are claimed. The central contribution is the dataset itself, provided via GitHub, with predicate vocabularies. This is self-contained data creation grounded in external source material; no step reduces by construction to its own inputs or relies on load-bearing self-citation. The reader's score of 1.0 aligns with the absence of any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes a manually annotated dataset rather than a mathematical derivation; it rests on the domain assumption that mission documentation can be faithfully rendered in FOL and on standard assumptions about first-order logic semantics.

axioms (1)

domain assumption Natural language descriptions of mission phases can be manually translated into first-order logic without critical loss of temporal or operational meaning.
This premise underpins the entire benchmark construction described in the abstract.

pith-pipeline@v0.9.0 · 5766 in / 1256 out tokens · 34422 ms · 2026-05-20T11:17:40.595494+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We manually annotate these documents with corresponding FOL representations that capture temporal structure, agent roles, and operational dependencies.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The pilot benchmark currently consists of 11 multi-phase mission segments... 251 unique predicate signatures

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

[1]

The EO-1 Autonomous Science Agent,

S. Chien et al., “The EO-1 Autonomous Science Agent,” in Proc. Int. Joint Conf. Autonomous Agents and Multiagent Systems (AAMAS), 2004

work page 2004
[2]

A delay-tolerant network architecture for challenged inter- nets,

K. Fall, “A delay-tolerant network architecture for challenged inter- nets,” in Proc. ACM SIGCOMM, 2003

work page 2003
[3]

Learning to map sentences to logical form,

L. S. Zettlemoyer and M. Collins, “Learning to map sentences to logical form,” in Proc. UAI, 2005

work page 2005
[4]

A Survey on Semantic Parsing

A. Kamath and R. Das, “A survey on semantic parsing,” arXiv preprint arXiv:1812.00978, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing,

H. Moore and A. Shah, “Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing,” in AAAI 2026 Bridge LM- Reasoning Workshop

work page 2026
[6]

Evaluating NL2SQL via SQL2NL,

M. Safarzadeh, A. Oroojlooyjadid, and D. Roth, “Evaluating NL2SQL via SQL2NL,” EMNLP, 2025

work page 2025
[7]

FOLIO: Natural Language Reasoning with First-Order Logic,

S. Han, X. Guo, J. Chen, and X. Sun, “FOLIO: Natural Language Reasoning with First-Order Logic,” in Proc. EMNLP, 2024

work page 2024
[8]

Planetary Data System,

NASA Planetary Data System (PDS), “Planetary Data System,” 2024. [Online]. Available: https://pds.nasa.gov/

work page 2024
[9]

A syntactic neural model for general-purpose code generation,

P. Yin and G. Neubig, “A syntactic neural model for general-purpose code generation,” in Proc. ACL, 2017

work page 2017
[10]

Neural-symbolic computing: An effective methodology for principled integration of machine learning and rea- soning,

A. d’Avila Garcez et al., “Neural-symbolic computing: An effective methodology for principled integration of machine learning and rea- soning,” arXiv, 2019

work page 2019
[11]

SQuAD: 100,000+ questions for machine com- prehension of text,

P. Rajpurkar et al., “SQuAD: 100,000+ questions for machine com- prehension of text,” in Proc. EMNLP, 2016

work page 2016
[12]

The Opportunity Rover’s Athena science investigation at Meridiani Planum, Mars,

S. W. Squyres et al., “The Opportunity Rover’s Athena science investigation at Meridiani Planum, Mars,” Science, 2004

work page 2004
[13]

Qwen1.5: Open foundation models,

Qwen Team, “Qwen1.5: Open foundation models,” 2024. [Online]. Available: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat

work page 2024
[14]

Mistral-7B-Instruct-v0.3,

A. Mistral AI, “Mistral-7B-Instruct-v0.3,” 2024. [Online]. Available: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

work page 2024
[15]

Phi-3: Technical report,

Microsoft, “Phi-3: Technical report,” 2024. [Online]. Available: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

work page 2024
[16]

LLaMA 3.1: Open and efficient foundation models,

Meta AI, “LLaMA 3.1: Open and efficient foundation models,”

work page
[17]

Available: https://huggingface.co/meta-llama/Meta- Llama-3.1-8B-Instruct

[Online]. Available: https://huggingface.co/meta-llama/Meta- Llama-3.1-8B-Instruct

work page
[18]

Gemma: Open models based on Gemini research,

Google DeepMind, “Gemma: Open models based on Gemini research,” 2024. [Online]. Available: https://huggingface.co/google/gemma-2-9b-it

work page 2024
[19]

Chain-of-Thought prompting elicits reasoning in large language models,

J. Wei et al., “Chain-of-Thought prompting elicits reasoning in large language models,” in Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS), 2022

work page 2022
[20]

Lost in the middle: How language models use long contexts,

N. F. Liu et al., “Lost in the middle: How language models use long contexts,” Trans. ACL, 2024

work page 2024
[21]

A multitask, multilingual, multimodal evaluation of ChatGPT,

Y . Bang et al., “A multitask, multilingual, multimodal evaluation of ChatGPT,” in Proceedings of the 13th International Joint Conference on Natural Language (IJCNLP), 2023

work page 2023

[1] [1]

The EO-1 Autonomous Science Agent,

S. Chien et al., “The EO-1 Autonomous Science Agent,” in Proc. Int. Joint Conf. Autonomous Agents and Multiagent Systems (AAMAS), 2004

work page 2004

[2] [2]

A delay-tolerant network architecture for challenged inter- nets,

K. Fall, “A delay-tolerant network architecture for challenged inter- nets,” in Proc. ACM SIGCOMM, 2003

work page 2003

[3] [3]

Learning to map sentences to logical form,

L. S. Zettlemoyer and M. Collins, “Learning to map sentences to logical form,” in Proc. UAI, 2005

work page 2005

[4] [4]

A Survey on Semantic Parsing

A. Kamath and R. Das, “A survey on semantic parsing,” arXiv preprint arXiv:1812.00978, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing,

H. Moore and A. Shah, “Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing,” in AAAI 2026 Bridge LM- Reasoning Workshop

work page 2026

[6] [6]

Evaluating NL2SQL via SQL2NL,

M. Safarzadeh, A. Oroojlooyjadid, and D. Roth, “Evaluating NL2SQL via SQL2NL,” EMNLP, 2025

work page 2025

[7] [7]

FOLIO: Natural Language Reasoning with First-Order Logic,

S. Han, X. Guo, J. Chen, and X. Sun, “FOLIO: Natural Language Reasoning with First-Order Logic,” in Proc. EMNLP, 2024

work page 2024

[8] [8]

Planetary Data System,

NASA Planetary Data System (PDS), “Planetary Data System,” 2024. [Online]. Available: https://pds.nasa.gov/

work page 2024

[9] [9]

A syntactic neural model for general-purpose code generation,

P. Yin and G. Neubig, “A syntactic neural model for general-purpose code generation,” in Proc. ACL, 2017

work page 2017

[10] [10]

Neural-symbolic computing: An effective methodology for principled integration of machine learning and rea- soning,

A. d’Avila Garcez et al., “Neural-symbolic computing: An effective methodology for principled integration of machine learning and rea- soning,” arXiv, 2019

work page 2019

[11] [11]

SQuAD: 100,000+ questions for machine com- prehension of text,

P. Rajpurkar et al., “SQuAD: 100,000+ questions for machine com- prehension of text,” in Proc. EMNLP, 2016

work page 2016

[12] [12]

The Opportunity Rover’s Athena science investigation at Meridiani Planum, Mars,

S. W. Squyres et al., “The Opportunity Rover’s Athena science investigation at Meridiani Planum, Mars,” Science, 2004

work page 2004

[13] [13]

Qwen1.5: Open foundation models,

Qwen Team, “Qwen1.5: Open foundation models,” 2024. [Online]. Available: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat

work page 2024

[14] [14]

Mistral-7B-Instruct-v0.3,

A. Mistral AI, “Mistral-7B-Instruct-v0.3,” 2024. [Online]. Available: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

work page 2024

[15] [15]

Phi-3: Technical report,

Microsoft, “Phi-3: Technical report,” 2024. [Online]. Available: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

work page 2024

[16] [16]

LLaMA 3.1: Open and efficient foundation models,

Meta AI, “LLaMA 3.1: Open and efficient foundation models,”

work page

[17] [17]

Available: https://huggingface.co/meta-llama/Meta- Llama-3.1-8B-Instruct

[Online]. Available: https://huggingface.co/meta-llama/Meta- Llama-3.1-8B-Instruct

work page

[18] [18]

Gemma: Open models based on Gemini research,

Google DeepMind, “Gemma: Open models based on Gemini research,” 2024. [Online]. Available: https://huggingface.co/google/gemma-2-9b-it

work page 2024

[19] [19]

Chain-of-Thought prompting elicits reasoning in large language models,

J. Wei et al., “Chain-of-Thought prompting elicits reasoning in large language models,” in Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS), 2022

work page 2022

[20] [20]

Lost in the middle: How language models use long contexts,

N. F. Liu et al., “Lost in the middle: How language models use long contexts,” Trans. ACL, 2024

work page 2024

[21] [21]

A multitask, multilingual, multimodal evaluation of ChatGPT,

Y . Bang et al., “A multitask, multilingual, multimodal evaluation of ChatGPT,” in Proceedings of the 13th International Joint Conference on Natural Language (IJCNLP), 2023

work page 2023