pith. sign in

arxiv: 2605.17911 · v1 · pith:TEMCCLUHnew · submitted 2026-05-18 · 💻 cs.CL

A Pilot Benchmark for NL-to-FOL Translation in Planetary Exploration

Pith reviewed 2026-05-20 11:17 UTC · model grok-4.3

classification 💻 cs.CL
keywords natural language to logicfirst-order logicbenchmark datasetplanetary explorationNASA missionsformal reasoningautonomous agentstemporal structure
0
0 comments X

The pith

This paper introduces a pilot benchmark for translating natural language mission documents from planetary exploration into first-order logic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors create a benchmark to convert descriptions of space missions from NASA's records into formal logic statements. This matters because future robotic explorers will need to understand and reason about complex mission rules with little human input and under severe communication limits. The dataset draws from real documents covering launches through orbital operations between 2003 and 2013. Annotations supply logic formulas that track time, agent roles, and operational dependencies. The work also supplies predicate vocabularies and typed constants so researchers can test translation systems with different amounts of prior knowledge provided.

Core claim

We introduce a pilot benchmark for translating natural language into First-Order Logic within the domain of planetary exploration. The dataset is constructed from real mission documentation sourced from NASA's Planetary Data System, spanning missions from 2003 to 2013. These documents describe mission phases such as launch, boost, coast, cruise, and orbital operations in rich natural language. We manually annotate these documents with corresponding FOL representations that capture temporal structure, agent roles, and operational dependencies. In addition, we provide structured predicate vocabularies and typed constants to enable controlled experimentation with varying levels of prior knowing

What carries the argument

The manually created dataset that pairs natural language descriptions of planetary missions with first-order logic formulas, together with supplied predicate vocabularies and typed constants.

Load-bearing premise

The manual annotations correctly represent the temporal structure, agent roles, and operational dependencies in the mission documents without introducing significant errors or omissions.

What would settle it

Run an automated reasoner on the provided FOL translations and check whether it can derive known safety constraints or operational rules that are stated or implied in the original NASA documents; consistent failure would show the annotations do not support reliable downstream reasoning.

Figures

Figures reproduced from arXiv: 2605.17911 by Hayden Moore, Mahfuza Farooque, Suman Saha.

Figure 1
Figure 1. Figure 1: Conceptual setting motivating this work. In long-duration, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
read the original abstract

Future planetary exploration envisions autonomous robotic agents operating under severe communication constraints, without global positioning, and with minimal human intervention. In such environments, agents must not only perceive and act, but also reason over mission objectives, operational constraints, and evolving environmental conditions. While prior work has largely focused on perception and control, the translation of high-level mission knowledge into structured, machine-interpretable representations remains underexplored. We introduce a pilot benchmark for translating natural language (NL) into First-Order Logic (FOL) within the domain of planetary exploration. The dataset is constructed from real mission documentation sourced from NASA's Planetary Data System (PDS), spanning missions from 2003 to 2013. These documents describe mission phases such as launch, boost, coast, cruise, and orbital operations in rich natural language. We manually annotate these documents with corresponding FOL representations that capture temporal structure, agent roles, and operational dependencies. In addition, we provide structured predicate vocabularies and typed constants to enable controlled experimentation with varying levels of prior knowledge. This pilot benchmark provides a foundation for research at the intersection of language understanding and formal reasoning, grounded in real-world, safety-critical mission data. The dataset is provided at: https://github.com/HaydenMM/planetary-logic-benchmark/blob/main/pilot_benchmark.json

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a pilot benchmark for translating natural language into First-Order Logic (FOL) for planetary exploration. The dataset is derived from real NASA Planetary Data System (PDS) mission documents spanning 2003 to 2013, which are manually annotated with FOL representations to capture temporal structure, agent roles, and operational dependencies. Structured predicate vocabularies and typed constants are also provided to support experimentation.

Significance. This work has potential significance in bridging natural language processing and formal reasoning for autonomous systems in constrained environments. The use of authentic mission documentation adds practical relevance that synthetic benchmarks lack. However, the absence of quality assurance details for the annotations reduces the immediate utility until addressed.

major comments (2)
  1. [Manual Annotation] No annotation protocol, inter-annotator agreement, or validation against experts is described in the manuscript. This is load-bearing for the claim that the FOL formulas accurately represent the mission documents' temporal and dependency aspects.
  2. [Benchmark Construction] The paper lacks any error analysis or spot-checks for the FOL translations, which is particularly important given the sensitivity of logical representations to quantifier and predicate choices.
minor comments (2)
  1. [Abstract] Include quantitative details such as the number of documents or total FOL formulas in the pilot dataset to better contextualize the scale.
  2. [GitHub Repository] The link is provided, but consider adding a description of the file structure or example entries in the main text for reader convenience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review of our pilot benchmark. The comments correctly identify areas where additional documentation would improve the benchmark's utility and transparency. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Manual Annotation] No annotation protocol, inter-annotator agreement, or validation against experts is described in the manuscript. This is load-bearing for the claim that the FOL formulas accurately represent the mission documents' temporal and dependency aspects.

    Authors: We agree that the absence of a described annotation protocol limits the immediate interpretability of the benchmark. The annotations were produced by the lead author following an internal iterative process: drafting FOL formulas from the source documents, cross-referencing against the provided predicate vocabulary and typed constants, and checking for coverage of temporal ordering, agent roles, and operational dependencies. Because this is explicitly a pilot release, no multi-annotator study was conducted. In the revised manuscript we will insert a new subsection that (i) states the annotation guidelines used, (ii) describes the single-annotator workflow and any consistency checks performed, and (iii) explicitly notes the lack of inter-annotator agreement statistics as a limitation of the current pilot, together with plans for future multi-annotator validation. revision: yes

  2. Referee: [Benchmark Construction] The paper lacks any error analysis or spot-checks for the FOL translations, which is particularly important given the sensitivity of logical representations to quantifier and predicate choices.

    Authors: We concur that systematic error analysis is valuable for a logic benchmark. During construction we performed informal spot-checks to verify that quantifier scope and predicate selection matched the intended semantics of each mission document excerpt. These checks were not reported. The revised version will contain a dedicated error-analysis subsection that (a) enumerates recurring translation challenges (e.g., temporal scoping, choice between event and state predicates), (b) presents representative examples of both successful and ambiguous cases drawn from the released dataset, and (c) summarizes the outcomes of the internal consistency reviews already performed. This addition will give users a clearer picture of the benchmark's current strengths and limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark is direct dataset construction from external documents

full rationale

The paper introduces a pilot benchmark by manually annotating real NASA PDS mission documents (2003-2013) with FOL representations capturing temporal structure, agent roles, and dependencies. No mathematical derivations, equations, fitted parameters, or predictions are claimed. The central contribution is the dataset itself, provided via GitHub, with predicate vocabularies. This is self-contained data creation grounded in external source material; no step reduces by construction to its own inputs or relies on load-bearing self-citation. The reader's score of 1.0 aligns with the absence of any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes a manually annotated dataset rather than a mathematical derivation; it rests on the domain assumption that mission documentation can be faithfully rendered in FOL and on standard assumptions about first-order logic semantics.

axioms (1)
  • domain assumption Natural language descriptions of mission phases can be manually translated into first-order logic without critical loss of temporal or operational meaning.
    This premise underpins the entire benchmark construction described in the abstract.

pith-pipeline@v0.9.0 · 5766 in / 1256 out tokens · 34422 ms · 2026-05-20T11:17:40.595494+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

  1. [1]

    The EO-1 Autonomous Science Agent,

    S. Chien et al., “The EO-1 Autonomous Science Agent,” in Proc. Int. Joint Conf. Autonomous Agents and Multiagent Systems (AAMAS), 2004

  2. [2]

    A delay-tolerant network architecture for challenged inter- nets,

    K. Fall, “A delay-tolerant network architecture for challenged inter- nets,” in Proc. ACM SIGCOMM, 2003

  3. [3]

    Learning to map sentences to logical form,

    L. S. Zettlemoyer and M. Collins, “Learning to map sentences to logical form,” in Proc. UAI, 2005

  4. [4]

    A Survey on Semantic Parsing

    A. Kamath and R. Das, “A survey on semantic parsing,” arXiv preprint arXiv:1812.00978, 2018

  5. [5]

    Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing,

    H. Moore and A. Shah, “Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing,” in AAAI 2026 Bridge LM- Reasoning Workshop

  6. [6]

    Evaluating NL2SQL via SQL2NL,

    M. Safarzadeh, A. Oroojlooyjadid, and D. Roth, “Evaluating NL2SQL via SQL2NL,” EMNLP, 2025

  7. [7]

    FOLIO: Natural Language Reasoning with First-Order Logic,

    S. Han, X. Guo, J. Chen, and X. Sun, “FOLIO: Natural Language Reasoning with First-Order Logic,” in Proc. EMNLP, 2024

  8. [8]

    Planetary Data System,

    NASA Planetary Data System (PDS), “Planetary Data System,” 2024. [Online]. Available: https://pds.nasa.gov/

  9. [9]

    A syntactic neural model for general-purpose code generation,

    P. Yin and G. Neubig, “A syntactic neural model for general-purpose code generation,” in Proc. ACL, 2017

  10. [10]

    Neural-symbolic computing: An effective methodology for principled integration of machine learning and rea- soning,

    A. d’Avila Garcez et al., “Neural-symbolic computing: An effective methodology for principled integration of machine learning and rea- soning,” arXiv, 2019

  11. [11]

    SQuAD: 100,000+ questions for machine com- prehension of text,

    P. Rajpurkar et al., “SQuAD: 100,000+ questions for machine com- prehension of text,” in Proc. EMNLP, 2016

  12. [12]

    The Opportunity Rover’s Athena science investigation at Meridiani Planum, Mars,

    S. W. Squyres et al., “The Opportunity Rover’s Athena science investigation at Meridiani Planum, Mars,” Science, 2004

  13. [13]

    Qwen1.5: Open foundation models,

    Qwen Team, “Qwen1.5: Open foundation models,” 2024. [Online]. Available: https://huggingface.co/Qwen/Qwen1.5-1.8B-Chat

  14. [14]

    Mistral-7B-Instruct-v0.3,

    A. Mistral AI, “Mistral-7B-Instruct-v0.3,” 2024. [Online]. Available: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

  15. [15]

    Phi-3: Technical report,

    Microsoft, “Phi-3: Technical report,” 2024. [Online]. Available: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

  16. [16]

    LLaMA 3.1: Open and efficient foundation models,

    Meta AI, “LLaMA 3.1: Open and efficient foundation models,”

  17. [17]

    Available: https://huggingface.co/meta-llama/Meta- Llama-3.1-8B-Instruct

    [Online]. Available: https://huggingface.co/meta-llama/Meta- Llama-3.1-8B-Instruct

  18. [18]

    Gemma: Open models based on Gemini research,

    Google DeepMind, “Gemma: Open models based on Gemini research,” 2024. [Online]. Available: https://huggingface.co/google/gemma-2-9b-it

  19. [19]

    Chain-of-Thought prompting elicits reasoning in large language models,

    J. Wei et al., “Chain-of-Thought prompting elicits reasoning in large language models,” in Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS), 2022

  20. [20]

    Lost in the middle: How language models use long contexts,

    N. F. Liu et al., “Lost in the middle: How language models use long contexts,” Trans. ACL, 2024

  21. [21]

    A multitask, multilingual, multimodal evaluation of ChatGPT,

    Y . Bang et al., “A multitask, multilingual, multimodal evaluation of ChatGPT,” in Proceedings of the 13th International Joint Conference on Natural Language (IJCNLP), 2023