A Pilot Benchmark for NL-to-FOL Translation in Planetary Exploration
Pith reviewed 2026-05-20 11:17 UTC · model grok-4.3
The pith
This paper introduces a pilot benchmark for translating natural language mission documents from planetary exploration into first-order logic.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a pilot benchmark for translating natural language into First-Order Logic within the domain of planetary exploration. The dataset is constructed from real mission documentation sourced from NASA's Planetary Data System, spanning missions from 2003 to 2013. These documents describe mission phases such as launch, boost, coast, cruise, and orbital operations in rich natural language. We manually annotate these documents with corresponding FOL representations that capture temporal structure, agent roles, and operational dependencies. In addition, we provide structured predicate vocabularies and typed constants to enable controlled experimentation with varying levels of prior knowing
What carries the argument
The manually created dataset that pairs natural language descriptions of planetary missions with first-order logic formulas, together with supplied predicate vocabularies and typed constants.
Load-bearing premise
The manual annotations correctly represent the temporal structure, agent roles, and operational dependencies in the mission documents without introducing significant errors or omissions.
What would settle it
Run an automated reasoner on the provided FOL translations and check whether it can derive known safety constraints or operational rules that are stated or implied in the original NASA documents; consistent failure would show the annotations do not support reliable downstream reasoning.
Figures
read the original abstract
Future planetary exploration envisions autonomous robotic agents operating under severe communication constraints, without global positioning, and with minimal human intervention. In such environments, agents must not only perceive and act, but also reason over mission objectives, operational constraints, and evolving environmental conditions. While prior work has largely focused on perception and control, the translation of high-level mission knowledge into structured, machine-interpretable representations remains underexplored. We introduce a pilot benchmark for translating natural language (NL) into First-Order Logic (FOL) within the domain of planetary exploration. The dataset is constructed from real mission documentation sourced from NASA's Planetary Data System (PDS), spanning missions from 2003 to 2013. These documents describe mission phases such as launch, boost, coast, cruise, and orbital operations in rich natural language. We manually annotate these documents with corresponding FOL representations that capture temporal structure, agent roles, and operational dependencies. In addition, we provide structured predicate vocabularies and typed constants to enable controlled experimentation with varying levels of prior knowledge. This pilot benchmark provides a foundation for research at the intersection of language understanding and formal reasoning, grounded in real-world, safety-critical mission data. The dataset is provided at: https://github.com/HaydenMM/planetary-logic-benchmark/blob/main/pilot_benchmark.json
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a pilot benchmark for translating natural language into First-Order Logic (FOL) for planetary exploration. The dataset is derived from real NASA Planetary Data System (PDS) mission documents spanning 2003 to 2013, which are manually annotated with FOL representations to capture temporal structure, agent roles, and operational dependencies. Structured predicate vocabularies and typed constants are also provided to support experimentation.
Significance. This work has potential significance in bridging natural language processing and formal reasoning for autonomous systems in constrained environments. The use of authentic mission documentation adds practical relevance that synthetic benchmarks lack. However, the absence of quality assurance details for the annotations reduces the immediate utility until addressed.
major comments (2)
- [Manual Annotation] No annotation protocol, inter-annotator agreement, or validation against experts is described in the manuscript. This is load-bearing for the claim that the FOL formulas accurately represent the mission documents' temporal and dependency aspects.
- [Benchmark Construction] The paper lacks any error analysis or spot-checks for the FOL translations, which is particularly important given the sensitivity of logical representations to quantifier and predicate choices.
minor comments (2)
- [Abstract] Include quantitative details such as the number of documents or total FOL formulas in the pilot dataset to better contextualize the scale.
- [GitHub Repository] The link is provided, but consider adding a description of the file structure or example entries in the main text for reader convenience.
Simulated Author's Rebuttal
We thank the referee for their constructive review of our pilot benchmark. The comments correctly identify areas where additional documentation would improve the benchmark's utility and transparency. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Manual Annotation] No annotation protocol, inter-annotator agreement, or validation against experts is described in the manuscript. This is load-bearing for the claim that the FOL formulas accurately represent the mission documents' temporal and dependency aspects.
Authors: We agree that the absence of a described annotation protocol limits the immediate interpretability of the benchmark. The annotations were produced by the lead author following an internal iterative process: drafting FOL formulas from the source documents, cross-referencing against the provided predicate vocabulary and typed constants, and checking for coverage of temporal ordering, agent roles, and operational dependencies. Because this is explicitly a pilot release, no multi-annotator study was conducted. In the revised manuscript we will insert a new subsection that (i) states the annotation guidelines used, (ii) describes the single-annotator workflow and any consistency checks performed, and (iii) explicitly notes the lack of inter-annotator agreement statistics as a limitation of the current pilot, together with plans for future multi-annotator validation. revision: yes
-
Referee: [Benchmark Construction] The paper lacks any error analysis or spot-checks for the FOL translations, which is particularly important given the sensitivity of logical representations to quantifier and predicate choices.
Authors: We concur that systematic error analysis is valuable for a logic benchmark. During construction we performed informal spot-checks to verify that quantifier scope and predicate selection matched the intended semantics of each mission document excerpt. These checks were not reported. The revised version will contain a dedicated error-analysis subsection that (a) enumerates recurring translation challenges (e.g., temporal scoping, choice between event and state predicates), (b) presents representative examples of both successful and ambiguous cases drawn from the released dataset, and (c) summarizes the outcomes of the internal consistency reviews already performed. This addition will give users a clearer picture of the benchmark's current strengths and limitations. revision: yes
Circularity Check
No circularity: benchmark is direct dataset construction from external documents
full rationale
The paper introduces a pilot benchmark by manually annotating real NASA PDS mission documents (2003-2013) with FOL representations capturing temporal structure, agent roles, and dependencies. No mathematical derivations, equations, fitted parameters, or predictions are claimed. The central contribution is the dataset itself, provided via GitHub, with predicate vocabularies. This is self-contained data creation grounded in external source material; no step reduces by construction to its own inputs or relies on load-bearing self-citation. The reader's score of 1.0 aligns with the absence of any enumerated circularity pattern.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Natural language descriptions of mission phases can be manually translated into first-order logic without critical loss of temporal or operational meaning.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We manually annotate these documents with corresponding FOL representations that capture temporal structure, agent roles, and operational dependencies.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The pilot benchmark currently consists of 11 multi-phase mission segments... 251 unique predicate signatures
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The EO-1 Autonomous Science Agent,
S. Chien et al., “The EO-1 Autonomous Science Agent,” in Proc. Int. Joint Conf. Autonomous Agents and Multiagent Systems (AAMAS), 2004
work page 2004
-
[2]
A delay-tolerant network architecture for challenged inter- nets,
K. Fall, “A delay-tolerant network architecture for challenged inter- nets,” in Proc. ACM SIGCOMM, 2003
work page 2003
-
[3]
Learning to map sentences to logical form,
L. S. Zettlemoyer and M. Collins, “Learning to map sentences to logical form,” in Proc. UAI, 2005
work page 2005
-
[4]
A. Kamath and R. Das, “A survey on semantic parsing,” arXiv preprint arXiv:1812.00978, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing,
H. Moore and A. Shah, “Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing,” in AAAI 2026 Bridge LM- Reasoning Workshop
work page 2026
-
[6]
M. Safarzadeh, A. Oroojlooyjadid, and D. Roth, “Evaluating NL2SQL via SQL2NL,” EMNLP, 2025
work page 2025
-
[7]
FOLIO: Natural Language Reasoning with First-Order Logic,
S. Han, X. Guo, J. Chen, and X. Sun, “FOLIO: Natural Language Reasoning with First-Order Logic,” in Proc. EMNLP, 2024
work page 2024
-
[8]
NASA Planetary Data System (PDS), “Planetary Data System,” 2024. [Online]. Available: https://pds.nasa.gov/
work page 2024
-
[9]
A syntactic neural model for general-purpose code generation,
P. Yin and G. Neubig, “A syntactic neural model for general-purpose code generation,” in Proc. ACL, 2017
work page 2017
-
[10]
A. d’Avila Garcez et al., “Neural-symbolic computing: An effective methodology for principled integration of machine learning and rea- soning,” arXiv, 2019
work page 2019
-
[11]
SQuAD: 100,000+ questions for machine com- prehension of text,
P. Rajpurkar et al., “SQuAD: 100,000+ questions for machine com- prehension of text,” in Proc. EMNLP, 2016
work page 2016
-
[12]
The Opportunity Rover’s Athena science investigation at Meridiani Planum, Mars,
S. W. Squyres et al., “The Opportunity Rover’s Athena science investigation at Meridiani Planum, Mars,” Science, 2004
work page 2004
-
[13]
Qwen1.5: Open foundation models,
Qwen Team, “Qwen1.5: Open foundation models,” 2024. [Online]. Available: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat
work page 2024
-
[14]
A. Mistral AI, “Mistral-7B-Instruct-v0.3,” 2024. [Online]. Available: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
work page 2024
-
[15]
Microsoft, “Phi-3: Technical report,” 2024. [Online]. Available: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
work page 2024
-
[16]
LLaMA 3.1: Open and efficient foundation models,
Meta AI, “LLaMA 3.1: Open and efficient foundation models,”
-
[17]
Available: https://huggingface.co/meta-llama/Meta- Llama-3.1-8B-Instruct
[Online]. Available: https://huggingface.co/meta-llama/Meta- Llama-3.1-8B-Instruct
-
[18]
Gemma: Open models based on Gemini research,
Google DeepMind, “Gemma: Open models based on Gemini research,” 2024. [Online]. Available: https://huggingface.co/google/gemma-2-9b-it
work page 2024
-
[19]
Chain-of-Thought prompting elicits reasoning in large language models,
J. Wei et al., “Chain-of-Thought prompting elicits reasoning in large language models,” in Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS), 2022
work page 2022
-
[20]
Lost in the middle: How language models use long contexts,
N. F. Liu et al., “Lost in the middle: How language models use long contexts,” Trans. ACL, 2024
work page 2024
-
[21]
A multitask, multilingual, multimodal evaluation of ChatGPT,
Y . Bang et al., “A multitask, multilingual, multimodal evaluation of ChatGPT,” in Proceedings of the 13th International Joint Conference on Natural Language (IJCNLP), 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.