AI-Friendly LaTeX: Using LaTeX Code as a Knowledge Source for Retrieval-Augmented Generation

Tom Verhoeff

arxiv: 2605.22923 · v1 · pith:6T7WBBGSnew · submitted 2026-05-21 · 💻 cs.IR · cs.CL

AI-Friendly LaTeX: Using LaTeX Code as a Knowledge Source for Retrieval-Augmented Generation

Tom Verhoeff This is my paper

Pith reviewed 2026-05-25 02:09 UTC · model grok-4.3

classification 💻 cs.IR cs.CL

keywords LaTeX preprocessingRetrieval-augmented generationRAGMathematical documentsTechnical documentsDocument chunkingKnowledge extractionCross-reference resolution

0 comments

The pith

LaTeX source, after resolving references and expanding macros, supplies a richer knowledge base for retrieval-augmented generation on mathematical and technical documents than PDF text does.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LaTeX source files preserve structural labels, sectioning commands, custom macros, and authorial intent that PDF extraction routinely loses or distorts. It shows how a targeted preprocessing step can convert that source, together with auxiliary files, into Markdown and JSONL chunks ready for vector indexing. A sympathetic reader would care because better-grounded retrieval should produce more reliable answers from language models when they handle textbooks, lecture notes, or exercises. The method treats the original source as the primary artifact rather than the rendered PDF.

Core claim

The central claim is that LaTeX source can serve as a superior starting point for RAG on mathematical and technical material once cross-references are resolved, custom macros are interpreted, exercises are identified, and the results are emitted as Markdown and JSONL chunks suitable for indexing; this preserves information that PDF extraction typically discards.

What carries the argument

A focused preprocessing pipeline that ingests LaTeX source plus its compiled auxiliary files and optional author annotations to emit Markdown and JSONL chunks.

If this is right

Retrieval systems can index resolved cross-references directly, allowing answers to point accurately to labeled equations or theorems.
Custom macros expand into readable text, so definitions and notation remain consistent across retrieved fragments.
Exercises and examples become explicitly tagged, enabling targeted retrieval for practice problems rather than narrative text.
Author annotations, when present, add semantic metadata that improves chunk relevance without requiring post-hoc parsing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could be wrapped around existing LaTeX build systems so that every compiled PDF is accompanied by an up-to-date set of AI-ready chunks.
Textbooks already distributed only as source could be turned into public RAG indexes with minimal additional author effort.
The approach suggests a general principle: any markup language that encodes explicit structure may outperform its rendered form as a retrieval source once preprocessed.

Load-bearing premise

Cross-references can be resolved, custom macros interpreted, and exercises identified reliably through preprocessing without substantial information loss or heavy reliance on author-supplied metadata.

What would settle it

Run the same set of technical questions against a RAG system once using preprocessed LaTeX chunks and once using PDF-extracted text from an identical textbook; measure the fraction of answers that correctly cite the right sections and resolve references.

Figures

Figures reproduced from arXiv: 2605.22923 by Tom Verhoeff.

**Figure 1.** Figure 1: Major stages of the LATEX RAG preprocessing pipeline. Input and output files are shown with a gray background. 1. Read the main LATEX file. 2. Recursively follow \input and \include commands. 3. Read the main .aux file and recursively follow \@input entries. 4. Parse \newlabel entries to build a label table. 5. Load optional YAML annotations. 6. Convert selected LaTeX structures to Markdown. 7. Resolve \re… view at source ↗

**Figure 2.** Figure 2: Commuting diagram for the split combinator. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Transformation of a LATEX figure environment (left) into a RAG chunk (right). The \AIDescription block (shaded, left) is a no-op during typesetting but becomes the semantic description in the embedding text (shaded, right). \begin{figure} ... \caption{Commuting diagram for ...} \label{fig:split-diagram} \AIDescription{ The figure shows two functions f:X to A and g:X to B combined into a single function fro… view at source ↗

read the original abstract

Large language models can answer questions about textbooks, lecture notes, and programming exercises more reliably when their answers are grounded in an explicit knowledge source. Retrieval-augmented generation (RAG) is a common approach: relevant fragments of a document are retrieved and inserted into the model context before answering. For mathematical and technical material, the original LaTeX source can be a better starting point than a PDF, because it contains structural information, labels, sectioning commands, macros, and authorial intent that are often lost or distorted in PDF extraction. However, LaTeX source is not automatically AI-friendly. Cross-references must be resolved, custom macros must be interpreted, exercises and examples must be identified, and author-supplied semantic metadata may be needed. This article describes a focused preprocessing approach for turning LaTeX source, together with its compiled auxiliary files and optional author annotations, into Markdown and JSONL chunks suitable for indexing in a vector database.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper describes a LaTeX preprocessing pipeline for RAG but supplies no experiments or results to back its claims.

read the letter

The main point here is a description of how to turn LaTeX source plus auxiliaries into Markdown and JSONL chunks for RAG, with the idea that this keeps structural details like cross-references, macros, and section labels that PDF extraction often drops. That is the core contribution: a focused engineering pipeline scoped to technical documents such as textbooks and exercises. It correctly flags the information loss that happens when going through PDF and suggests using author annotations where needed to identify exercises or resolve references. That part is straightforward and practical for anyone already working on document ingestion for vector databases. The approach itself does not introduce new algorithms or theory; it adapts standard chunking and metadata handling to LaTeX specifics. The absence of any evaluation data, error rates, or even a small retrieval test is the clear limitation. Without that, it is impossible to judge whether the preprocessing actually reduces hallucinations or improves answer quality over PDF baselines, and the assumption that macros and cross-references can be resolved reliably stays untested. The paper stays within its stated scope as a method sketch rather than a validated result, so there is no internal contradiction or overclaim in the text. This is mainly useful to practitioners building RAG systems for STEM education materials who need concrete ideas for handling LaTeX. A reader already familiar with RAG pipelines might pick up one or two implementation tips but will not find new evidence or benchmarks. It does not rise to the level of a paper that needs formal refereeing in its current state; the lack of any empirical grounding makes it closer to a technical note.

Referee Report

1 major / 0 minor

Summary. The manuscript presents a preprocessing pipeline that transforms LaTeX source files (including auxiliaries and annotations) into Markdown and JSONL chunks for use in RAG systems. It posits that LaTeX preserves more structural and semantic information than PDF for mathematical and technical documents.

Significance. If validated, this approach could meaningfully improve the grounding of LLM responses for technical content by leveraging LaTeX's inherent structure. The paper's strength lies in its explicit scoping as an engineering contribution rather than an untested theoretical claim.

major comments (1)

[Abstract] Abstract: The assertion that LaTeX source 'can be a better starting point than a PDF' because it contains structural information 'often lost or distorted in PDF extraction' is presented without any empirical results, comparative metrics, error analysis, or case studies to support the superiority claim or demonstrate the pipeline's effectiveness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for careful scoping in the abstract. We agree that the current wording asserts a comparative advantage without supporting evidence and will revise the abstract accordingly to align with the paper's engineering focus.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that LaTeX source 'can be a better starting point than a PDF' because it contains structural information 'often lost or distorted in PDF extraction' is presented without any empirical results, comparative metrics, error analysis, or case studies to support the superiority claim or demonstrate the pipeline's effectiveness.

Authors: We acknowledge that the abstract presents an unsupported comparative claim. The manuscript is explicitly positioned as an engineering contribution describing a preprocessing pipeline rather than an empirical evaluation. We will revise the abstract to remove the assertion of superiority and instead frame the structural advantages of LaTeX as a motivating rationale for the pipeline, without implying proven outperformance over PDF extraction. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a methodological description of a LaTeX preprocessing pipeline for RAG without any derivations, equations, fitted parameters, predictions, or self-citations that serve as load-bearing premises. The central claim—that LaTeX preserves structural information better than PDF extraction—is an engineering observation, not a result derived from prior inputs or self-referential definitions. No steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a methods paper describing a software preprocessing pipeline with no mathematical claims, fitted parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5692 in / 959 out tokens · 19886 ms · 2026-05-25T02:09:53.517847+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

[1]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.Yih, T.Rocktäschel, S.Riedel, andD.Kiela.Retrieval-augmentedgenerationforknowledge- intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020

work page 2020
[2]

MacFarlane

J. MacFarlane. Pandoc: A universal document converter.https://pandoc.org, 2006

work page 2006
[3]

B. R. Miller. LaTeXML: A LATEX to XML/HTML/MathML converter.https://math.nist. gov/~BMiller/LaTeXML/, 2010

work page 2010
[4]

Verhoeff.latex-rag-preprocessor: Preprocessing LATEX source for retrieval-augmented generation.https://gitlab.tue.nl/t-verhoeff-software/latex-rag-preprocessor, 2026

T. Verhoeff.latex-rag-preprocessor: Preprocessing LATEX source for retrieval-augmented generation.https://gitlab.tue.nl/t-verhoeff-software/latex-rag-preprocessor, 2026

work page 2026
[5]

ChromaDB: The AI-native open-source embedding database.https://www

Chroma. ChromaDB: The AI-native open-source embedding database.https://www. trychroma.com, 2022

work page 2022
[6]

exercise

Anthropic. Claude Code: An agentic coding tool.https://claude.ai/code, 2025. A Author Instructions for AI-Friendly LATEX This appendix provides practical guidance for authors who want their LATEX source to work well with the preprocessor. The instructions below cover the in-source macro approach. For the external YAML approach and a discussion of when to ...

work page 2025
[7]

Readchunks.jsonlline by line

work page
[8]

For each chunk, send theembedding_textfield to an embedding model to obtain a vector

work page
[9]

Lyapunov stability

Store thevector together withthe chunk’sid,kind,markdown,heading_path, and remaining metadata in the vector database. Therepositoryincludesaready-to-runscriptforChromaDB[5]usinglocalsentence-transformer embeddings (no API key required): pip install chromadb sentence-transformers python3 examples/ingest_chromadb.py rag_out/chunks.jsonl The script is idempo...

work page

[1] [1]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.Yih, T.Rocktäschel, S.Riedel, andD.Kiela.Retrieval-augmentedgenerationforknowledge- intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020

work page 2020

[2] [2]

MacFarlane

J. MacFarlane. Pandoc: A universal document converter.https://pandoc.org, 2006

work page 2006

[3] [3]

B. R. Miller. LaTeXML: A LATEX to XML/HTML/MathML converter.https://math.nist. gov/~BMiller/LaTeXML/, 2010

work page 2010

[4] [4]

Verhoeff.latex-rag-preprocessor: Preprocessing LATEX source for retrieval-augmented generation.https://gitlab.tue.nl/t-verhoeff-software/latex-rag-preprocessor, 2026

T. Verhoeff.latex-rag-preprocessor: Preprocessing LATEX source for retrieval-augmented generation.https://gitlab.tue.nl/t-verhoeff-software/latex-rag-preprocessor, 2026

work page 2026

[5] [5]

ChromaDB: The AI-native open-source embedding database.https://www

Chroma. ChromaDB: The AI-native open-source embedding database.https://www. trychroma.com, 2022

work page 2022

[6] [6]

exercise

Anthropic. Claude Code: An agentic coding tool.https://claude.ai/code, 2025. A Author Instructions for AI-Friendly LATEX This appendix provides practical guidance for authors who want their LATEX source to work well with the preprocessor. The instructions below cover the in-source macro approach. For the external YAML approach and a discussion of when to ...

work page 2025

[7] [7]

Readchunks.jsonlline by line

work page

[8] [8]

For each chunk, send theembedding_textfield to an embedding model to obtain a vector

work page

[9] [9]

Lyapunov stability

Store thevector together withthe chunk’sid,kind,markdown,heading_path, and remaining metadata in the vector database. Therepositoryincludesaready-to-runscriptforChromaDB[5]usinglocalsentence-transformer embeddings (no API key required): pip install chromadb sentence-transformers python3 examples/ingest_chromadb.py rag_out/chunks.jsonl The script is idempo...

work page