AI-Friendly LaTeX: Using LaTeX Code as a Knowledge Source for Retrieval-Augmented Generation
Pith reviewed 2026-05-25 02:09 UTC · model grok-4.3
The pith
LaTeX source, after resolving references and expanding macros, supplies a richer knowledge base for retrieval-augmented generation on mathematical and technical documents than PDF text does.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that LaTeX source can serve as a superior starting point for RAG on mathematical and technical material once cross-references are resolved, custom macros are interpreted, exercises are identified, and the results are emitted as Markdown and JSONL chunks suitable for indexing; this preserves information that PDF extraction typically discards.
What carries the argument
A focused preprocessing pipeline that ingests LaTeX source plus its compiled auxiliary files and optional author annotations to emit Markdown and JSONL chunks.
If this is right
- Retrieval systems can index resolved cross-references directly, allowing answers to point accurately to labeled equations or theorems.
- Custom macros expand into readable text, so definitions and notation remain consistent across retrieved fragments.
- Exercises and examples become explicitly tagged, enabling targeted retrieval for practice problems rather than narrative text.
- Author annotations, when present, add semantic metadata that improves chunk relevance without requiring post-hoc parsing.
Where Pith is reading between the lines
- The same pipeline could be wrapped around existing LaTeX build systems so that every compiled PDF is accompanied by an up-to-date set of AI-ready chunks.
- Textbooks already distributed only as source could be turned into public RAG indexes with minimal additional author effort.
- The approach suggests a general principle: any markup language that encodes explicit structure may outperform its rendered form as a retrieval source once preprocessed.
Load-bearing premise
Cross-references can be resolved, custom macros interpreted, and exercises identified reliably through preprocessing without substantial information loss or heavy reliance on author-supplied metadata.
What would settle it
Run the same set of technical questions against a RAG system once using preprocessed LaTeX chunks and once using PDF-extracted text from an identical textbook; measure the fraction of answers that correctly cite the right sections and resolve references.
Figures
read the original abstract
Large language models can answer questions about textbooks, lecture notes, and programming exercises more reliably when their answers are grounded in an explicit knowledge source. Retrieval-augmented generation (RAG) is a common approach: relevant fragments of a document are retrieved and inserted into the model context before answering. For mathematical and technical material, the original LaTeX source can be a better starting point than a PDF, because it contains structural information, labels, sectioning commands, macros, and authorial intent that are often lost or distorted in PDF extraction. However, LaTeX source is not automatically AI-friendly. Cross-references must be resolved, custom macros must be interpreted, exercises and examples must be identified, and author-supplied semantic metadata may be needed. This article describes a focused preprocessing approach for turning LaTeX source, together with its compiled auxiliary files and optional author annotations, into Markdown and JSONL chunks suitable for indexing in a vector database.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a preprocessing pipeline that transforms LaTeX source files (including auxiliaries and annotations) into Markdown and JSONL chunks for use in RAG systems. It posits that LaTeX preserves more structural and semantic information than PDF for mathematical and technical documents.
Significance. If validated, this approach could meaningfully improve the grounding of LLM responses for technical content by leveraging LaTeX's inherent structure. The paper's strength lies in its explicit scoping as an engineering contribution rather than an untested theoretical claim.
major comments (1)
- [Abstract] Abstract: The assertion that LaTeX source 'can be a better starting point than a PDF' because it contains structural information 'often lost or distorted in PDF extraction' is presented without any empirical results, comparative metrics, error analysis, or case studies to support the superiority claim or demonstrate the pipeline's effectiveness.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for careful scoping in the abstract. We agree that the current wording asserts a comparative advantage without supporting evidence and will revise the abstract accordingly to align with the paper's engineering focus.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that LaTeX source 'can be a better starting point than a PDF' because it contains structural information 'often lost or distorted in PDF extraction' is presented without any empirical results, comparative metrics, error analysis, or case studies to support the superiority claim or demonstrate the pipeline's effectiveness.
Authors: We acknowledge that the abstract presents an unsupported comparative claim. The manuscript is explicitly positioned as an engineering contribution describing a preprocessing pipeline rather than an empirical evaluation. We will revise the abstract to remove the assertion of superiority and instead frame the structural advantages of LaTeX as a motivating rationale for the pipeline, without implying proven outperformance over PDF extraction. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents a methodological description of a LaTeX preprocessing pipeline for RAG without any derivations, equations, fitted parameters, predictions, or self-citations that serve as load-bearing premises. The central claim—that LaTeX preserves structural information better than PDF extraction—is an engineering observation, not a result derived from prior inputs or self-referential definitions. No steps match the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.Yih, T.Rocktäschel, S.Riedel, andD.Kiela.Retrieval-augmentedgenerationforknowledge- intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020
work page 2020
-
[2]
J. MacFarlane. Pandoc: A universal document converter.https://pandoc.org, 2006
work page 2006
-
[3]
B. R. Miller. LaTeXML: A LATEX to XML/HTML/MathML converter.https://math.nist. gov/~BMiller/LaTeXML/, 2010
work page 2010
-
[4]
T. Verhoeff.latex-rag-preprocessor: Preprocessing LATEX source for retrieval-augmented generation.https://gitlab.tue.nl/t-verhoeff-software/latex-rag-preprocessor, 2026
work page 2026
-
[5]
ChromaDB: The AI-native open-source embedding database.https://www
Chroma. ChromaDB: The AI-native open-source embedding database.https://www. trychroma.com, 2022
work page 2022
-
[6]
Anthropic. Claude Code: An agentic coding tool.https://claude.ai/code, 2025. A Author Instructions for AI-Friendly LATEX This appendix provides practical guidance for authors who want their LATEX source to work well with the preprocessor. The instructions below cover the in-source macro approach. For the external YAML approach and a discussion of when to ...
work page 2025
-
[7]
Readchunks.jsonlline by line
-
[8]
For each chunk, send theembedding_textfield to an embedding model to obtain a vector
-
[9]
Store thevector together withthe chunk’sid,kind,markdown,heading_path, and remaining metadata in the vector database. Therepositoryincludesaready-to-runscriptforChromaDB[5]usinglocalsentence-transformer embeddings (no API key required): pip install chromadb sentence-transformers python3 examples/ingest_chromadb.py rag_out/chunks.jsonl The script is idempo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.