pith. sign in

arxiv: 2605.22923 · v1 · pith:6T7WBBGSnew · submitted 2026-05-21 · 💻 cs.IR · cs.CL

AI-Friendly LaTeX: Using LaTeX Code as a Knowledge Source for Retrieval-Augmented Generation

Pith reviewed 2026-05-25 02:09 UTC · model grok-4.3

classification 💻 cs.IR cs.CL
keywords LaTeX preprocessingRetrieval-augmented generationRAGMathematical documentsTechnical documentsDocument chunkingKnowledge extractionCross-reference resolution
0
0 comments X

The pith

LaTeX source, after resolving references and expanding macros, supplies a richer knowledge base for retrieval-augmented generation on mathematical and technical documents than PDF text does.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LaTeX source files preserve structural labels, sectioning commands, custom macros, and authorial intent that PDF extraction routinely loses or distorts. It shows how a targeted preprocessing step can convert that source, together with auxiliary files, into Markdown and JSONL chunks ready for vector indexing. A sympathetic reader would care because better-grounded retrieval should produce more reliable answers from language models when they handle textbooks, lecture notes, or exercises. The method treats the original source as the primary artifact rather than the rendered PDF.

Core claim

The central claim is that LaTeX source can serve as a superior starting point for RAG on mathematical and technical material once cross-references are resolved, custom macros are interpreted, exercises are identified, and the results are emitted as Markdown and JSONL chunks suitable for indexing; this preserves information that PDF extraction typically discards.

What carries the argument

A focused preprocessing pipeline that ingests LaTeX source plus its compiled auxiliary files and optional author annotations to emit Markdown and JSONL chunks.

If this is right

  • Retrieval systems can index resolved cross-references directly, allowing answers to point accurately to labeled equations or theorems.
  • Custom macros expand into readable text, so definitions and notation remain consistent across retrieved fragments.
  • Exercises and examples become explicitly tagged, enabling targeted retrieval for practice problems rather than narrative text.
  • Author annotations, when present, add semantic metadata that improves chunk relevance without requiring post-hoc parsing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline could be wrapped around existing LaTeX build systems so that every compiled PDF is accompanied by an up-to-date set of AI-ready chunks.
  • Textbooks already distributed only as source could be turned into public RAG indexes with minimal additional author effort.
  • The approach suggests a general principle: any markup language that encodes explicit structure may outperform its rendered form as a retrieval source once preprocessed.

Load-bearing premise

Cross-references can be resolved, custom macros interpreted, and exercises identified reliably through preprocessing without substantial information loss or heavy reliance on author-supplied metadata.

What would settle it

Run the same set of technical questions against a RAG system once using preprocessed LaTeX chunks and once using PDF-extracted text from an identical textbook; measure the fraction of answers that correctly cite the right sections and resolve references.

Figures

Figures reproduced from arXiv: 2605.22923 by Tom Verhoeff.

Figure 1
Figure 1. Figure 1: Major stages of the LATEX RAG preprocessing pipeline. Input and output files are shown with a gray background. 1. Read the main LATEX file. 2. Recursively follow \input and \include commands. 3. Read the main .aux file and recursively follow \@input entries. 4. Parse \newlabel entries to build a label table. 5. Load optional YAML annotations. 6. Convert selected LaTeX structures to Markdown. 7. Resolve \re… view at source ↗
Figure 2
Figure 2. Figure 2: Commuting diagram for the split combinator. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Transformation of a LATEX figure environment (left) into a RAG chunk (right). The \AIDescription block (shaded, left) is a no-op during typesetting but becomes the semantic description in the embedding text (shaded, right). \begin{figure} ... \caption{Commuting diagram for ...} \label{fig:split-diagram} \AIDescription{ The figure shows two functions f:X to A and g:X to B combined into a single function fro… view at source ↗
read the original abstract

Large language models can answer questions about textbooks, lecture notes, and programming exercises more reliably when their answers are grounded in an explicit knowledge source. Retrieval-augmented generation (RAG) is a common approach: relevant fragments of a document are retrieved and inserted into the model context before answering. For mathematical and technical material, the original LaTeX source can be a better starting point than a PDF, because it contains structural information, labels, sectioning commands, macros, and authorial intent that are often lost or distorted in PDF extraction. However, LaTeX source is not automatically AI-friendly. Cross-references must be resolved, custom macros must be interpreted, exercises and examples must be identified, and author-supplied semantic metadata may be needed. This article describes a focused preprocessing approach for turning LaTeX source, together with its compiled auxiliary files and optional author annotations, into Markdown and JSONL chunks suitable for indexing in a vector database.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents a preprocessing pipeline that transforms LaTeX source files (including auxiliaries and annotations) into Markdown and JSONL chunks for use in RAG systems. It posits that LaTeX preserves more structural and semantic information than PDF for mathematical and technical documents.

Significance. If validated, this approach could meaningfully improve the grounding of LLM responses for technical content by leveraging LaTeX's inherent structure. The paper's strength lies in its explicit scoping as an engineering contribution rather than an untested theoretical claim.

major comments (1)
  1. [Abstract] Abstract: The assertion that LaTeX source 'can be a better starting point than a PDF' because it contains structural information 'often lost or distorted in PDF extraction' is presented without any empirical results, comparative metrics, error analysis, or case studies to support the superiority claim or demonstrate the pipeline's effectiveness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for careful scoping in the abstract. We agree that the current wording asserts a comparative advantage without supporting evidence and will revise the abstract accordingly to align with the paper's engineering focus.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that LaTeX source 'can be a better starting point than a PDF' because it contains structural information 'often lost or distorted in PDF extraction' is presented without any empirical results, comparative metrics, error analysis, or case studies to support the superiority claim or demonstrate the pipeline's effectiveness.

    Authors: We acknowledge that the abstract presents an unsupported comparative claim. The manuscript is explicitly positioned as an engineering contribution describing a preprocessing pipeline rather than an empirical evaluation. We will revise the abstract to remove the assertion of superiority and instead frame the structural advantages of LaTeX as a motivating rationale for the pipeline, without implying proven outperformance over PDF extraction. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a methodological description of a LaTeX preprocessing pipeline for RAG without any derivations, equations, fitted parameters, predictions, or self-citations that serve as load-bearing premises. The central claim—that LaTeX preserves structural information better than PDF extraction—is an engineering observation, not a result derived from prior inputs or self-referential definitions. No steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a methods paper describing a software preprocessing pipeline with no mathematical claims, fitted parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5692 in / 959 out tokens · 19886 ms · 2026-05-25T02:09:53.517847+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

  1. [1]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.Yih, T.Rocktäschel, S.Riedel, andD.Kiela.Retrieval-augmentedgenerationforknowledge- intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020

  2. [2]

    MacFarlane

    J. MacFarlane. Pandoc: A universal document converter.https://pandoc.org, 2006

  3. [3]

    B. R. Miller. LaTeXML: A LATEX to XML/HTML/MathML converter.https://math.nist. gov/~BMiller/LaTeXML/, 2010

  4. [4]

    Verhoeff.latex-rag-preprocessor: Preprocessing LATEX source for retrieval-augmented generation.https://gitlab.tue.nl/t-verhoeff-software/latex-rag-preprocessor, 2026

    T. Verhoeff.latex-rag-preprocessor: Preprocessing LATEX source for retrieval-augmented generation.https://gitlab.tue.nl/t-verhoeff-software/latex-rag-preprocessor, 2026

  5. [5]

    ChromaDB: The AI-native open-source embedding database.https://www

    Chroma. ChromaDB: The AI-native open-source embedding database.https://www. trychroma.com, 2022

  6. [6]

    exercise

    Anthropic. Claude Code: An agentic coding tool.https://claude.ai/code, 2025. A Author Instructions for AI-Friendly LATEX This appendix provides practical guidance for authors who want their LATEX source to work well with the preprocessor. The instructions below cover the in-source macro approach. For the external YAML approach and a discussion of when to ...

  7. [7]

    Readchunks.jsonlline by line

  8. [8]

    For each chunk, send theembedding_textfield to an embedding model to obtain a vector

  9. [9]

    Lyapunov stability

    Store thevector together withthe chunk’sid,kind,markdown,heading_path, and remaining metadata in the vector database. Therepositoryincludesaready-to-runscriptforChromaDB[5]usinglocalsentence-transformer embeddings (no API key required): pip install chromadb sentence-transformers python3 examples/ingest_chromadb.py rag_out/chunks.jsonl The script is idempo...