pith. machine review for the scientific record. sign in

arxiv: 2605.10296 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

Recognition: no theorem link

Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG
keywords RAGdocument understandingUkrainian QAretrieval augmented generationrerankingmultiple choice questionsPDF processingQwen models
0
0 comments X

The pith

A retrieval-augmented pipeline with structure-preserving PDF chunking and answer-option-aware reranking reaches 96 percent accuracy on Ukrainian multi-domain document QA.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a straightforward retrieval-augmented pipeline can effectively answer multiple-choice questions in Ukrainian across various document domains. By chunking PDFs in a way that maintains their original structure and using a reranker that takes into account both the question and all answer options, the system improves how well it finds relevant passages. This leads to higher accuracy in selecting the correct answer from a small number of top passages. The results indicate that these targeted choices in retrieval and ranking are more useful than building elaborate additional processing steps, at least within the limits of a shared task competition.

Core claim

The authors built a RAG pipeline that chunks PDFs while preserving their structure, retrieves passages with a dense embedder, reranks them using a model fine-tuned to consider both the question and answer choices, and then generates the answer from the top passages using a large language model. On held-out data this raised retrieval recall at one from 0.70 to 0.79 and answer accuracy from 0.93 to 0.97, with leaderboard scores of 0.945 and 0.960. The work claims that these two design choices—structure preservation and answer-space-aware relevance scoring—outperform the addition of complex downstream heuristics under competition rules.

What carries the argument

Contextual chunking of PDFs paired with reranking that conditions on both the question and the set of answer options.

If this is right

  • Reranking that incorporates answer options improves the quality of retrieved passages for multiple-choice questions.
  • Limiting generation to the top two reranked passages is enough to reach high answer accuracy.
  • Preserving the original layout and order in PDF chunking aids retrieval in multi-domain document collections.
  • Off-the-shelf large language models can serve as the backbone for both retrieval and answer selection in this setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline might work for other low-resource languages that have similar PDF-based document collections.
  • Answer-aware reranking could reduce the need for task-specific fine-tuning in other retrieval-augmented QA applications.
  • If document structure varies greatly across domains, the chunking method may need adaptation for best results.

Load-bearing premise

The test questions and documents in the shared task represent the distribution of real-world Ukrainian multi-domain document understanding problems, and the benefits of the reranking step will appear on entirely new document collections without any additional tuning.

What would settle it

Measuring performance on a fresh collection of Ukrainian PDFs drawn from different domains or with altered question formats; a substantial drop below the reported accuracy would indicate the approach does not generalize as claimed.

read the original abstract

We participated in the Fifth UNLP shared task on multi-domain document understanding, where systems must answer Ukrainian multiple-choice questions from PDF collections and localize the supporting document and page. We propose a retrieval-augmented pipeline built around three ideas: contextual chunking of PDFs, question-aware dense retrieval and reranking conditioned on both the question and answer options, and constrained answer generation from a small set of reranked passages. Our final system uses Qwen3-Embedding-8B for retrieval, a fine-tuned Qwen3-Reranker-8B for passage ranking, and Qwen3-32B for answer selection. On a held-out split, reranking improves Recall@1 from 0.6957 to 0.7935, while using the top-2 reranked passages raises answer accuracy from 0.9348 to 0.9674. Our best leaderboard run reached 0.9452 on the public leaderboard and 0.9598 on the private leaderboard. Our results suggest that, under strict code-competition constraints, preserving document structure and making relevance estimation aware of the answer space are more effective than adding complex downstream heuristics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper reports on a system for the Fifth UNLP shared task on Ukrainian multi-domain document understanding, where the task is to answer multiple-choice questions from PDF collections and localize supporting documents and pages. The proposed off-the-shelf RAG pipeline features contextual chunking to preserve document structure, question-aware dense retrieval using Qwen3-Embedding-8B, reranking with a fine-tuned Qwen3-Reranker-8B conditioned on the question and answer options, and constrained generation using Qwen3-32B from the top reranked passages. On a held-out split, the system shows Recall@1 improving from 0.6957 to 0.7935 with reranking and answer accuracy from 0.9348 to 0.9674 with top-2 passages. Leaderboard results are 0.9452 public and 0.9598 private. The authors conclude that under strict constraints, structure preservation and answer-space-aware relevance estimation outperform complex downstream heuristics.

Significance. Assuming the empirical results are robust, this work contributes a practical demonstration that targeted use of large language models for retrieval and reranking, with emphasis on document structure and answer option awareness, can deliver strong performance in a challenging multilingual, multi-domain setting. It provides concrete evidence favoring simpler RAG designs over heuristic-heavy approaches in competition-like environments, which may generalize to other low-resource language document understanding tasks. The specific model choices and metric improvements offer a useful reference point for the community.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'preserving document structure and making relevance estimation aware of the answer space are more effective than adding complex downstream heuristics' lacks direct comparative evidence. The reported results only show gains from reranking (Recall@1 from 0.6957 to 0.7935) and top-2 usage (accuracy from 0.9348 to 0.9674) within the proposed pipeline; no ablations or baselines that incorporate complex heuristics (such as multi-hop LLM reasoning or ensemble retrieval) are provided to support the superiority inference.
  2. [Evaluation on held-out split] Evaluation on held-out split: The numeric lifts are presented without error bars, confidence intervals, or statistical tests, and the manuscript provides no details on the construction of the held-out split or its representativeness relative to the leaderboard test distribution. This weakens support for the generalizability claim in the abstract.
minor comments (1)
  1. The abstract would benefit from a short overview sentence listing the three core pipeline components before the results, to improve immediate readability for readers unfamiliar with the shared task.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of our work's significance. We address the two major comments point by point below, with plans for targeted revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'preserving document structure and making relevance estimation aware of the answer space are more effective than adding complex downstream heuristics' lacks direct comparative evidence. The reported results only show gains from reranking (Recall@1 from 0.6957 to 0.7935) and top-2 usage (accuracy from 0.9348 to 0.9674) within the proposed pipeline; no ablations or baselines that incorporate complex heuristics (such as multi-hop LLM reasoning or ensemble retrieval) are provided to support the superiority inference.

    Authors: We agree that the manuscript lacks direct ablations or baselines against complex heuristic approaches such as multi-hop LLM reasoning or ensemble retrieval. The claim in the abstract is an inference drawn from our pipeline's strong performance (0.9598 private leaderboard) in the shared task under strict constraints, where we avoided such methods. Since we lack access to other participants' internal designs, direct comparisons are not possible. We will revise the abstract to qualify the language, stating that our results suggest these design choices are effective in this constrained setting rather than claiming broad superiority. We will also add a clarifying sentence in the discussion section. revision: partial

  2. Referee: [Evaluation on held-out split] Evaluation on held-out split: The numeric lifts are presented without error bars, confidence intervals, or statistical tests, and the manuscript provides no details on the construction of the held-out split or its representativeness relative to the leaderboard test distribution. This weakens support for the generalizability claim in the abstract.

    Authors: We acknowledge that error bars, confidence intervals, and statistical tests are absent, as all experiments were single-run under shared-task time and compute limits. The held-out split was formed by randomly sampling 20% of the organizers' training data with domain stratification to preserve multi-domain coverage; we will add this explicit description to the evaluation section. We will also note the single-run limitation and its implications for generalizability claims. Recomputing with multiple seeds for error bars is not feasible in the current revision timeline, but the observed lifts align with the final leaderboard results. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results rest on external leaderboard evaluation

full rationale

The manuscript describes an empirical RAG pipeline for a shared-task competition, reporting Recall@1, accuracy, and leaderboard scores obtained from held-out splits and public/private test sets. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The central suggestion that structure preservation and answer-aware reranking outperform complex heuristics is an interpretive claim drawn from the observed gains (e.g., reranking lifting Recall@1 from 0.6957 to 0.7935), not a derivation that reduces to its own inputs by construction. Evaluation relies on an external competition benchmark rather than internally generated quantities, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The contribution is empirical and relies on standard NLP assumptions about dense retrieval and LLM generation; the only explicit free parameter visible in the abstract is the choice of top-2 passages.

free parameters (1)
  • number of reranked passages for generation
    The paper states that using the top-2 reranked passages raises accuracy from 0.9348 to 0.9674, indicating a tuned hyperparameter.
axioms (1)
  • domain assumption Question-and-answer-option-aware dense retrieval plus reranking will surface the correct supporting passage for multiple-choice QA.
    Implicit in the pipeline design and not further justified in the abstract.

pith-pipeline@v0.9.0 · 5537 in / 1124 out tokens · 67896 ms · 2026-05-12T05:06:53.800606+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 5 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    Docling Technical Report

    Deep Search Team , month =. Docling Technical Report , url =. 2408.09869 , doi =

  8. [8]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  9. [9]

    Ivanyuk-Skulskiy, Bogdan and Zaliznyi, Anton and Reshetar, Oleksand and Protsyk, Oleksiy and Romanchuk, Bohdan and Shpihanovych, Vladyslav , month = oct, title =

  10. [13]

    2024 , eprint=

    Revolutionizing Retrieval-Augmented Generation with Enhanced PDF Structure Recognition , author=. 2024 , eprint=

  11. [15]

    2025 , eprint=

    Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation , author=. 2025 , eprint=

  12. [16]

    2025 , eprint=

    HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking , author=. 2025 , eprint=

  13. [20]

    2025 , eprint=

    Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models , author=. 2025 , eprint=

  14. [23]

    CoRR , volume =

    Tri Nguyen and Mir Rosenberg and Xia Song and Jianfeng Gao and Saurabh Tiwary and Rangan Majumder and Li Deng , title =. CoRR , volume =. 2016 , url =

  15. [24]

    arXiv preprint , year=

    GooAQ: Open Question Answering with Diverse Answer Types , author=. arXiv preprint , year=

  16. [25]

    2024 , publisher =

    Tom Aarsen , title =. 2024 , publisher =

  17. [26]

    Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

    Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

  18. [28]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  19. [29]

    2026 , eprint=

    Diffusion-Pretrained Dense and Contextual Embeddings , author=. 2026 , eprint=

  20. [30]

    2025 , eprint=

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=

  21. [31]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  22. [32]

    MamayLM v1.0: An efficient state-of-the-art multimodal Ukrainian LLM , author=

  23. [33]

    Paniv, Yurii and Didenko, Bohdan and Haltiuk, Mykola and Humennyy, Vladyslav and Kravchenko, Andrian and Kyslyi, Roman and Makovska, Viktoriia and Orlovskyi, Artem and Ruban, Bohdan and Rudko, Maksym-Yurii and Senyk, Anastasiia and Drushchak, Nazarii and Chaplynskyi, Dmytro and Romanyshyn, Mariana , month = oct, title =

  24. [34]

    2025 , month = mar, day =

    Daniel Han and Michael Han , title =. 2025 , month = mar, day =

  25. [35]

    2024 , eprint=

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author=. 2024 , eprint=

  26. [36]

    Volodymyr Sydorskyi and Nataliia Romanyshyn and Roman Kyslyi and Olena Nahorna , booktitle =. The. 2026 , address =

  27. [37]

    Tom Aarsen. 2024. natural-questions-hard-negatives. https://huggingface.co/datasets/tomaarsen/natural-questions-hard-negatives. Hugging Face dataset, accessed 2026-04-08

  28. [38]

    Anton Bazdyrev, Ivan Bashtovyi, Ivan Havlytskyi, Oleksandr Kharytonov, and Artur Khodakovskyi. 2025. https://doi.org/10.18653/v1/2025.unlp-1.13 Transforming causal LLM into MLM encoder for detecting social media manipulation in telegram . In Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025), pages 112--119, Vienna, Austr...

  29. [39]

    Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, and Dong Yu. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.845 Dense X retrieval: What retrieval granularity should we use? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15159--15177, Miami, Florida, USA. Associati...

  30. [40]

    Sedigheh Eslami, Maksim Gaiduk, Markus Krimmel, Louis Milliken, Bo Wang, and Denis Bykov. 2026. https://arxiv.org/abs/2602.11151 Diffusion-pretrained dense and contextual embeddings . Preprint, arXiv:2602.11151

  31. [41]

    Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan-Thorsten Peter, Juraj Juraska, Parker Riley, Daniel Deutsch, Geza Kovacs, Cole Dilanni, Colin Cherry, Eleftheria Briakou, Elizabeth Nielsen, Jiaming Luo, Kat Black, Ryan Mullins, Sweta Agrawal, Wenda Xu, Erin Kats, Stephane Jaskiewicz, and 2 others. 2026. https://arxiv.org/abs/2601.09012 Translate G emma...

  32. [42]

    Michael Günther, Isabelle Mohr, Daniel James Williams, Bo Wang, and Han Xiao. 2025. https://arxiv.org/abs/2409.04701 Late chunking: Contextual chunk embeddings using long-context embedding models . Preprint, arXiv:2409.04701

  33. [43]

    Mykola Haltiuk and Aleksander Smywi \'n ski-Pohl. 2025. https://doi.org/10.18653/v1/2025.unlp-1.14 On the path to make U krainian a high-resource language . In Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025), pages 120--130, Vienna, Austria (online). Association for Computational Linguistics

  34. [44]

    Daniel Han and Michael Han. 2025. https://unsloth.ai/blog/gemma3 Fine-tune & run gemma 3

  35. [45]

    Bogdan Ivanyuk-Skulskiy, Anton Zaliznyi, Oleksand Reshetar, Oleksiy Protsyk, Bohdan Romanchuk, and Vladyslav Shpihanovych. 2021. https://github.com/fido-ai/ua-datasets ua\_datasets: a collection of ukrainian language datasets

  36. [46]

    Daniel Khashabi, Amos Ng, Tushar Khot, Ashish Sabharwal, Hannaneh Hajishirzi, and Chris Callison-Burch. 2021. Gooaq: Open question answering with diverse answer types. arXiv preprint

  37. [47]

    Peerat Limkonchotiwat, Wuttikorn Ponwitayarat, Lalita Lowphansirikul, Potsawee Manakul, Can Udomcharoenchaikit, Ekapol Chuangsuwanich, and Sarana Nutanong. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.157 M c C rolin: Multi-consistency cross-lingual training for retrieval question answering . In Findings of the Association for Computational Lingu...

  38. [48]

    Demiao Lin. 2024. https://arxiv.org/abs/2401.12599 Revolutionizing retrieval-augmented generation with enhanced pdf structure recognition . Preprint, arXiv:2401.12599

  39. [49]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. https://arxiv.org/abs/2306.00978 Awq: Activation-aware weight quantization for llm compression and acceleration . Preprint, arXiv:2306.00978

  40. [50]

    Wensheng Lu, Keyu Chen, Ruizhi Qiao, and Xing Sun. 2025. https://arxiv.org/abs/2509.11552 Hichunk: Evaluating and enhancing retrieval-augmented generation with hierarchical chunking . Preprint, arXiv:2509.11552

  41. [51]

    Carlo Merola and Jaspinder Singh. 2025. https://arxiv.org/abs/2504.19754 Reconstructing context: Evaluating advanced chunking strategies for retrieval-augmented generation . Preprint, arXiv:2504.19754

  42. [52]

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. https://arxiv.org/abs/1611.09268 MS MARCO: A human generated machine reading comprehension dataset . CoRR, abs/1611.09268

  43. [53]

    Yurii Paniv, Bohdan Didenko, Mykola Haltiuk, Vladyslav Humennyy, Andrian Kravchenko, Roman Kyslyi, Viktoriia Makovska, Artem Orlovskyi, Bohdan Ruban, Maksym-Yurii Rudko, Anastasiia Senyk, Nazarii Drushchak, Dmytro Chaplynskyi, and Mariana Romanyshyn. 2025. https://github.com/lapa-llm/lapa-llm/ Lapa LLM v0.1.2 — the most efficient Ukrainian open-source lan...

  44. [54]

    Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. https://doi.org/10.18653/v1/P18-2124 Know what you don ' t know: Unanswerable questions for SQ u AD . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784--789, Melbourne, Australia. Association for Computational Linguistics

  45. [55]

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 SQ u AD : 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392, Austin, Texas. Association for Computational Linguistics

  46. [56]

    Volodymyr Sydorskyi, Nataliia Romanyshyn, Roman Kyslyi, and Olena Nahorna. 2026. The UNLP 2026 shared task on multi-domain document understanding. In Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026), Lviv, Ukraine. Association for Computational Linguistics. To appear

  47. [57]

    Deep Search Team. 2024. https://doi.org/10.48550/arXiv.2408.09869 Docling technical report . Technical report

  48. [58]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://arxiv.org/abs/2503.19786...

  49. [59]

    Qwen Team. 2026. https://qwen.ai/blog?id=qwen3.5 Qwen3.5: Accelerating productivity with native multimodal agents

  50. [60]

    Dingmin Wang, Qiuyuan Huang, Matthew Jackson, and Jianfeng Gao. 2024. https://doi.org/10.1162/tacl_a_00646 Retrieve what you need: A mutual learning framework for open-domain question answering . Transactions of the Association for Computational Linguistics, 12:247--263

  51. [61]

    Zhitong Wang, Cheng Gao, Chaojun Xiao, Yufei Huang, Shuzheng Si, Kangyang Luo, Yuzhuo Bai, Wenhao Li, Tangjian Duan, Chuancheng Lv, Guoshan Lu, Gang Chen, Fanchao Qi, and Maosong Sun. 2025. https://doi.org/10.18653/v1/2025.findings-acl.422 Document segmentation matters for retrieval-augmented generation . In Findings of the Association for Computational L...

  52. [62]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

  53. [63]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 H otpot QA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, Brussels...

  54. [64]

    Hanna Yukhymenko, Anton Alexandrov, and Martin Vechev. 2025. Mamaylm v1.0: An efficient state-of-the-art multimodal ukrainian llm

  55. [65]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025. https://arxiv.org/abs/2506.05176 Qwen3 embedding: Advancing text embedding and reranking through foundation models . Preprint, arXiv:2506.05176